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ABSTRACT 


The new DFT algorithm of S. Winograd is developed and presented 
in detail. This is an algorithm which uses about of the number of 
multiplications used by the Cooley-Tukey algorithm and is applicable 
to any order which is a product of relatively prime factors from the 
following list: 2,3,4,5,7,8,9,16. The algorithm is presented in 

terms of a series of tableaus — one for each term in this list — 
which are convenient, compact, graphical representations of the 
sequence of arithmetic operations in the corresponding parts of the 
algorithm. Using these in conjunction with Tables 5-6 (pp. 80, 84), 
makes it relatively easy to apply the algorithm and evaluate its 
performance. 


The organization of the paper allows skipping a large part of it 
on a first reading (see p. 3). 
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I. Introduction 

Ever since the discovery of the FFT algorithm pQ, the following ques- 
tion must have been phrased in many minds: "Does the FFT algorithm represent the 

ultimate in the fast computation of the discrete Fourier transform (DFT) or is there 
a still faster algorithm yet to be discovered?" One answer was provided in 1968 
by Yavne QQ who showed that the number of multiplications could be halved while 
leaving the number of additions unchanged. More recently (1976), a significant 
step along this path was taken by S . Winograd who developed an algorithm 

which reduces the number of multiplications of the radix-2 FFT algorithm [jlTJ by 
a factor of about 5. This reduction is accompanied by a small increase or 

decrease in the number of additions. In most cases, the increase does not exceed 20%. 

Our basis for comparison both here and later on is the "nominal" performance 

of the Cooley-Tukey (FFT) algorithm, namely, the computation of an N-th order DFT 

lei 

of complex data with «y/£, T real multiplications and real additions where 

Li 

^ CT = 2N log 2 N; CT = 1.5 ^T ct (1.1) 




CT 


We adopt (1.1) as the basis for comparison for all N. 

Winograd' s algorithm then performs the above task using about 

J 

real multiplications. We devote the rest of this section to a description of the 
capabilities and constraints of the algorithm so that the reader could assess its 
suitability to his needs before delving any deeper. 

At its present state of development, the algorithm is applicable to any N 


satisfying 


» - n 


N, 


( 1 . 2 ) 


k=l 


in which the N, 's are relatively prime factors taken from the following list 


N fc = 2,3,4,5,7,8,9,16 


(1.3) 


The maximal N is therefore. 16’ 9 .7* 5 = 5040. All of the N values satisfying the 
above prescriptions are listed in the summary of the algorithm presented in Table 6 
(p. 84). The actual multiplication reduction factor is listed there for each N, 
under the heading G^. Note that the average of G ra for all N > 140 is about 5.5. 
with such a large reduction in the number of multiplications, it is quite obvious 
that the new algorithm will run faster than the Cooley-Tukey algorithm in most sys- 
t p.ms . To ma ke a more specific claim, we must know the basic system parameter y 


Eqn. (1.1) is adopted as a convenient yardstick. It should be borne m mind 
that in addition to Yavne' s algorithm, there are other FFT variants which are 
somewhat more efficient than (1.1). 



- 2 - 


which is the ratio of the time taken to execute one real multiplication to the time 
taken to execute one real addition. (The term "real" is used here as opposed to 
"complex"; not as opposed to "integer" in the Fortran language). For very large 
(microprocessors, software multipliers, etc.), the speed gain approaches 
asymptotically. For lower values of y, the gain would be smaller. Denoting the 
speed gain by G, it will be shown (pp. 85, 79) that 


G(y) = g ot 


y + 1 . 5 
y+R 


(1.4) 


where R is another parameter listed with G OT in Table 6. Obviously, it is now a 
trivial matter to compute the speed gain for any system and any permissible N. 

Since R > 1.5 for all practical N values, it is obvious that the advan- 
tage of this new algorithm diminishes with decreasing y. Yet, it is interesting 
to note that even in the extreme case of y = 1, the new algorithm is still the 
faster one for all N values of Table 6. 


The main disadvantage of the algorithm is its need for a large memory. 

We express this in terms of the parameter 1 4 ( of Table 6. For the processing 
,of complex data we have to have a storage array of 1.5^/ real words. *,(( varies 
from about 2N at the lower end of the table to about 4N at its upper end. Thus, 
for high N values, we require a storage array of size 6N. This is 4N more than 
the minimal requirements of 2N for storing the input vector. 

Of the total memory requirement of 1.5 real words, 0.5o^ are needed 
for the storage of precomputed constants. Since not all of these constants are 
distinct, it is probably feasible to reduce this part by the use of a more involved 
addressing scheme. 

Another probable disadvantage of the new algorithm is that, in comparison 
with the Cooley-Tukey algorithm, it might require more bits per word to maintain a 
prescribed level of precision. This effect is discussed in some more detail in 
section IX but the whole subject merits further study. 

The development of the algorithm consists of two distinct parts. In 
the first part, fast DFT algorithms for the low orders listed in (1.3) are 
developed as a set of building blocks. The second part introduces a combining 
algorithm which integrates groups of these building blocks into the desired final 
structures, namely, DFT algorithms for orders N prescribed by (1.2). 

The low-order algorithms of (1.3) are derivable from algorithms of orders 
2,4,6 of another type of transformation called LCT (Lef t-Circulant Transformation). 
Hence, the following structure of the paper: Section II is devoted to the DFT-LCT 

interrelationship. This is followed by the three LCT algorithms in Section III 
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and the seven DFT algorithms derived from them in Sections IV, V, VI. Section VII 
tackles the integration of these low order algorithms into the desired algorithm 
for order N satisfying (1.2). Section VIII is devoted to performance evaluation 
and Section IX concludes with an overview and some comments regarding "in-place" 
transformation. 

The treatment of the subject is detailed and relatively complete, aiming 
to provide a sound basis for further development of the subject. Naturally, this 
demands a substantial investment of time. It should be pointed out, however, that 
a reasonable grasp of the basic ideas and their application can be obtained by 
skipping the detailed derivations of the low-order algorithms. If this is 
acceptable, the following parts of the paper may prove sufficient: Section II, first 

part of Section III (up to the treatment of the left circulant of order 4 on p. 18) , 
the introduction of the n vector on p. 40, the tableau generalization in Section VI 
(portion bounded by eqns. (6.16), (6.17) on p. 49), Section VII, last part of 
Section VIII (following eqn. (8.9) on p. 83), and Section IX. 

We conclude this introduction with a few words regarding the circumstances 
that initiated work on the present paper. Winograd's description of his algorithm 
[3] is very short (1% pages) and, partly because of that, hard to follow. However, 
even with the necessary mathematical background and a full mastery of the paper, 
actual implementation would still be out of reach without a lot of additional hard 
work. This is partly due to the fact that the basic building blocks comprising the 
algorithm are only sketched in general outline — not actually constructed.^ It turns 
out that, for some of these, the transition from outline to end product is far from 
trivial. 

The work reported here was started as an effort to understand and be able 
to apply what appeared to be — and is in fact — a highly promising, fascinating, 
algorithm. It was in the course of this work, as the various difficulties were 
being overcome and the different pieces of the puzzle were beginning to form a coherent 
overall picture, that the prospect of sharing the knowledge thus acquired, began to 
have merit. There is no question that the absence of a fuller treatment of the 
algorithm constitutes a serious obstacle to its wider dissemination and use. This 
is where the present paper comes in. We present Winograd's algorithm in detail, we 
construct the required building blocks and we show how to incorporate them in the 
overall algorithm. In short, we provide the. missing links that would allow immediate 
implementation of the algorithm. 

Since the submission of this paper for publication, a summary of the algorithms did 
appear in print [10], though still without derivation. This helped to shed light on 
two points of discrepancy between Winograd's stated results [3] and the first version 
of this paper. For further details see footnotes 8 (p. 26) and 9 (p. 41). 
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II. Strategy Development 

The cornerstone of Winograd's algorithm is a theorem [jQ providing the 
solution to a seemingly unrelated problem: Given the two polynomials A(x) , B(x), 

what is the minimal number of multiplications required to compute 

(A(x)B(x)} mod C(x) (2.1) 

lc 

where C(x) is a given monic -polynomial. The connection with the DFT consists 
of two links: Firstly, the DFT matrix is shown to be related to another special 
transformation in which the transforming matrix is a lef t-circulant (exact defi- 
nition follows later). Secondly, the evaluation of this transformation is 
shown to be identical with the evaluation of (2.1). Thus, the minimization of 
the number of multiplications in (2.1) leads via the above two links to a mini- 
mization of the number of multiplications in the computation of the DFT. 

We proceed now with some required definitions. A Hankel matrix is one 
in which the value of element a^ is a function of (i+j). In such a matrix, 
one encounters identical elements as one moves along any diagonal sloping down 
and to the left. Obviously, the matrix is completely determined by its first row 
and last column. 

The matrix we will be concerned with here is a special case of a Hankel 
matrix, namely, a Hankel matrix for which (for order n and index range 0,1,..., 
n-1) 

a . = a n (l<p< n-1) (2.2) 

p ,n-l 0,p-l — — 

that is, the last column is a trivial rearrangement of the elements of the first 

row. Hence, this matrix is completely prescribed by its first row. Indeed, the 

second row is obtained from the first one by a circular left shift (element shifted 

out on the left, reappears on the right), the third is derived the same way from 

2 

the second and so on. We call such a matrix a lef t-circulant or ( equivalently ^ 


lc Tf C(x) is of degree n, it is said to be monic when the coefficient 
of x n is 1. 

2 

This is based on the term circulant which is commonly used to describe 
a matrix generated from its first row by circular right shifts. 
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an LC matrix. Similarly, the linear transformation effected by such a matrix 
will be referred to as an LCT (Left-Circulant-Transformation) . Finally, a 
matrix which is not LC but contains a submatrix which is, will be called a 
quasi-left-circulant matrix (QLC) . 

We turn now to the LCT-DFT link. As will become obvious later on, we 
should concern ourselves with a trivial modification of the DFT defined as follows: 

. 2tt 
_1 N 

W = e (2.3) 


N-l 

= W UV f y (u = 0,1 N-l) (2.4) 

v=0 

£2 in (2.4) is an arbitrary complex constant. When £2=1, (2.4) reduces to the 
standard DFT. 

Let N, the order of the DFT, satisfy 


N 



k < 


I" 

i 3 


(p prime; k integer) 
(p odd) 

(P = 2) 


(2.5) 


We proceed to show now that for such N, eqn. (2.4) can be brought into the form 
of a QLC matrix . The derivation follows QT] and is based on the number 
theoretic idea of a primitive root, g, the primitive root of N (satisfying (2.5)) 
is an integer whose integral powers (mod N) generate all integers in the interval 
(1,N) except multiples of p. 

The number of multiples of p in the above interval is 


N k-1 

P = P 

Therefore, the number of integers generated by g is 

k k-1 ... k-1 

n = p -p = (p-l)p 


( 2 . 6 ) 


(2.7) 


and we may say that the sequence 

{g P mod N} (p = 0,1, . . . ,n-l) (2.8) 

is just a permutation of those integers in the interval (1,N) which are not mul- 
tiples of p. 


This is true for a range wider than (2.5) 
for our purpose. 


However, (2.5) is sufficient 
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We use these ideas now to relabel the indices of (2.4) as follows. All 
indices which are not multiples of p will be represented as in (2.8). Speci- 
fically, denoting 


we define 



mod N 
mod N 


(P t O = 0,1, . . . ,n-l) 


(2.9) 


B = F ; b =f (p ,a = 0,1, . . .n-1) (2.10) 

p r a s K ’ 

The indices which are multiples of p are now used to define 

B, ... = F.; b... = f. (i mod p = 0) (2.11) 

(l) x* (l) i 


With these definitions we embark now on the elimination of F ,f from (2.4). 

u v 

In doing this, we split the summation of (2.4) into two parts. In the first 

part v = mp (m integer), that is, v mod p = 0. In the second part v = s. The 
4 

result is : 


where 


N-n-1 

B . = f -n y ^ tp 

(tp) tp /-/ 

m=0 

2 ^1 a 

b . . + n > W Pg b (t= 0,1, . 

(mp) a 

. . , N-n-1) 

(2.12) 

-o w 

n 

TD W> 

+ 

(p = 0,1, .. . ,n-l) 

(2.13) 

N-n-1 p 



b -n V w mpg b, , 
p (mp) 

(p = 0,1, .. . ,n^l) 

(2.14) 

n-1 



, . n E»^>b 

p a=0 ° 

(p = 0,1,. • • ,n-l) 

(2.15) 


The term (p+o) identifies (2.15) as a Hankel transformation, that is, if we 
write down (2.15) with p and O as the row and column indices, respectively, 
then the matrix transforming b into B' is a Hankel matrix. More than that, 
it is that special kind of a Hankel matrix referred to earlier as a left-circu- 
lant. To see this, note that in view of the LC condition (2.2) , our Hankel 


\je take here advantage of the fact that (2.3) ensures W^ m mOC ^ ^ = W™. 
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(2.16) 


(2.17) 

It is obvious that (2.16) is indeed true and (2.15) is an LCT. We conclude that the 
permutations (2.9)-(2.11) will transform any DFT matrix of order N prescribed by 
(2.5), into a QLC matrix whose LC portion is of order n. 

Of particular interest is the special case in which N is prime, that is, 

k=l; N=p, so that (2.7) yields 


matrix would be an LC if 


p+n-1 p-1 , „ 

g = g mod N 


But since the primitive root of N always satisfies' 


n 


g = 1 mod N 


n = N-l (2.18) 

and the sequence (2.8) is just a permutation of the integers l,2,...,n. In 
this case, eqns. (2.12), (2.14) simplify as follows: 



/ nzl v 


B (0) 

-4(0) + £ b a) 

(2.19) 


a=0 


B 

P 

~ (p = 0,1 , . . . ,n-l) 

(2.20) 


To illustrate these ideas, consider the case of N=7 . The least positive 
primitive root of 7 is 3 QQ . Indeed, direct computation yields 


P 

0 

1 

2 

3 

4 

5 

3 P mod 7 

1 

3 

2 

6 

4 

5 


Applying (2. 9) — ( 2 . 11) , we get the following form 


( 2 . 21 ) 


^In the standard treatment of primitive roots, (2.8) is usually replaced by 
{g P mod N} (p=l,2, . . . ,n) (2.8'). 

Congruence (2.17) shows that the two formulations are equivalent. To prove (2.17) 
assume g m = 1 mod N for m<n. It follows then that g ra+ ^ = g^ mod N and thus two of 
the terms in sequence (2.8 ) are equal. Hence the contradiction that g is not a 
primitive root. Conclusion: m = n. 
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The LC structure is quite apparent here. 

So far, we have established the first link to problem (2.1) for N satis- 
fying (2.5). This constraint on N will be relaxed later on. We turn now to the 
second link, namely, showing that evaluation of an LCT is equivalent to evaluation 
of (2.1). We intend to establish this equivalence and, from it, derive algorithms 
for some general low-order LC transformations. It should be pointed out, however, 
that the LC matrix we are concerned with is one obtained by permuting a DFT submatrix 
and, as such, is still a function of only one variable (W) whereas the general LC 
matrix of order n is a function of n variables. This suggests further simpli- 
fications in our case which will indeed be realized later on. 

The matrix multiplication we are considering is shown in (2.23) where 
the LC pattern is clearly visible. 
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We introduce now the auxiliary polynomials. (NOTE: 
cates its degree.) 


A (x) 
m 


m 


-E 


a^x 


The polynomial subscript indi- 


(2.24) 


m 


V x) = S b i xl 

i=0 


(2.25) 


m . 

T (x) = t.x 
m / j x 


i=0 


(2.26) 


Consider the product polynomial 


2m 


V 9 (x)=A (x) B (x) = y v.x* 
2m mm / * x 

i=0 


(2.27) 


It is easy to see that the coefficients of this polynomial are obtainable by the 
matrix product (2.28). 



b 


0 


m-2 


u i 

m-1 


b 

m 


(2.28) 
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Comparing this to (2.23), we see that interchanging the two indicated triangular 
sections in (2.28) will transform the matrix of (2.28) into a trivial augmentation 
of the matrix of (2.23). This fact can be translated to the following relation- 
ship between T (x) and V_ (x): 
m 2m 

T m (x) * V 2m (x) - (a m b l +a m-l b 2 + --- +a lV (x "‘’ 1 - 1> 

- <a m b 2 + -'- +a 2 b n ) xU"* 1 -!) 


- (a b 1 +a n b ) x m ^(x^^-l) 
. m m-1 m-1 m 


, m-1 , mfl , s 

- a b x (x -1) 
m m 


V x) - V 2 m < x > - (x”* 1 -!) F m .l (x > 

where F (x) is the indicated polynomial of degree (m-1) . Hence 


(2.29) 

(2.30) 


V 2n, ( x> 


mfl , 
x -1 


m-1 


(x) + 


T (x) 
m 

mfl , 
x -1 


(2.31) 


Note that, in the quotient on the right, the denominator degree is higher than 
the numerator degree. This means that T (x) is just the remainder obtained 
when dividing V 2m^ X ^ ^ (x ra "*"^-l) . In other words, 

T (x) = {A (x) B (x)} mod (x n -l) (2.32) 
m mm 


We have made use here of the fact that the order of the LC matrix in (2.23) is 

n = mfl (2.33) 

Eqn. (2.32) is a prescription for performing the LCT of (2.23) through poly- 
nomial manipulations identical with those of (2.1). This, then, establishes 
the second link. 

Consider now the number of multiplications required to evaluate (2.23). 

2 

Straight matrix multiplication requires n scalar multiplications. A much 
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lower value is prescribed by Winograd's theorem QV] . In its narrower appli- 
cation to the present case, the theorem states that if (x n -l) is representable 
as 


k(n) 

x n -l = ]-[ m^x) (2.34) 

i=l 


where the m^x) are distinct polynomials irreducible over the field of rationals, 
then the minimal number of multiplications is (2n-k) , provided multiplications by 
rational numbers are not counted. 

The exclusion of rational multiplications merits an explanation. Suppose 

we have a minimal realization of T (x) of the following form 

m 

R j 

V x > ■ £(r) V x) <2 - 35 > 

r=l r 


where J^., are integers and in which F r (x) involves no rational multipli- 
cations. According to the theorem, the F^'s will require a total of (2n-k) 
multiplications and we are essentially being told that the additional R rational 
multiplications appearing in (2.35) do not count. To see what is involved here, 
let us clear fractions in (2.35). Let K be the least common denominator and let 


Hence, 



(2.36) 


R 


K T 


m 


M - Y, 

r=l 


F (x) 
r 


(2.37) 


Each of the multiplications by can be implemented as J^-l additions so 

that KT m (x) does not require any multiplications above the (2n-k) used in 
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computing the F^'s. Finally, we compensate for the multiplication by K on 
the left, by prescaling the a matrix, that is, replacing a^, A^(x) by 

m 

S i'T ! V-' " I 3 / (2 - 38) 

i=0 

Thus, (2.32) is now replaced by 

T (x) = K{X (x) B (x)} mod (x n -l) (2.39) 

m mm 


and we see that multiplications by rationals can always be eliminated without an 
increase in the number of irrational multiplications. 

It should be pointed out that while this argument is theoretically 
sound, practically, one should be concerned with the cost in terms of the extra 
additions introduced to eliminate the rational multiplications. Obviously, when 
these extra additions take longer than the multiplications they replace, we 
would be better off leaving the multiplications in. This will, indeed, be the 
case when n is large. Therefore, the algorithm taking advantage of Winograd's 
theorem, has to be constructed in such a way that a DFT of large order is broken 
down into many LC transformations of low order. As a matter of fact, all the 
DFT orders appearing in Table 6 (N max = 5040) call for just three LC transfor- 
mations of orders 2, 4, 6. 

The factoring of (x n -l) for these three cases is shown in Table 1 in 
which the last column gives the minimal number of multiplications as stated by 
Winograd's theorem. In the next section we develop the specific algorithms which 
realize these minima. These three algorithms serve as a foundation for the sub- 
sequent construction of DFT algorithms for all orders listed in (1.3). 



-13- 


III . The Basic LCT Algorithms 

Our approach here is to present the general method first in sufficient 
detail so that its application to the three specific n values can be subsequently 
presented as a mostly self-explanatory sequence of equations. 

The starting point is (2.39) in which K is left indeterminate till the 
very end of the derivation. The factoring of x n -l is spelled out in Table 1 
which identifies the nu(x) factors of (2.34). With the iru's available, we 
evaluate T^(x) in a two-phase scheme based on the polynomial version of the Chinese 
Remainder Theorem [7]. In phase 1 we compute 


u. (x) = {A (x) B (x)} mod m (x) 
l mm l 


(3.2) 


for-vall i of (2.34). This is based entirely on results established in the Appendix 
and summarized in Tables Al, A2 there. In phase 2 we use the polynomial version 
of Garner's algorithm \jT\ to construct T m (x) from the u^'s. This calls for 
the auxiliary functions c„ (x) , v^(x) introduced below. Their utilization in 
the construction of T m (x) is spelled out in (3.6). 


£nu(x) c„ (x)J mod nu (x) = 1 (definition of c^(x)) 


v x (x) = u x (x) 


(3.3) 

(3.4) 


V ± (x) 


9 • • • 9 


{ (. . .^p(u i (x)-v 1 (x))c 1 i (x)-v 2 (x)Jc 2 i (x)-v 3 (x^c 3 ± (x)-, 

-v. (x)) c._, . (x)} modm.(x) 


k(m-l) i-1 


T (x) 
m 


! m ^ ^ I - -*- 

v. (x) + v -(x) n 

i=2 j=l 3 


(k from Table 1) 


(3.5) 

(3.6) 


The computation of c (x) 
m^ (x) = x-x . From (A. 5) 


is trivial when 
in the Appendix, 


nu (x) is of degree 1, that is, 
we see that, in this case. 
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Table 1 

Rational Factorization of x n -l 


(x-1) (x+1) 

(x-1) (x+1) (x 2 +l) 

(x-1) (x+1) (x 2 +x+l) (x 2 -x+l) 
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m. (x. ) c . . (x . ) = 1 
i 1 13 1 


(3.7) 


Now, any c „ (x) 
lowest degree we 


satisfying (3.3) 
get 


(and hence (3.7)) would do. 


Choosing the 


c . . (x) = c . . 
ij 13 


B i<V 


(3.8) 


With these derivation outlines spelled out, we turn now to specific cases. To 
establish the evolving pattern, we follow these outlines even in the low order 
case (n=2) where direct derivation could be simpler. 


Lef t-Circulant Transformation of Order 2 (Fig. 1) 

"b. 


t. 


1 

- 

t„ 


_ 0 _ 



a i a o 


a o a l 


0 


a. 

1 


V x) = Ev 1; B i (x) 

i=0 


IV 

i=0 


> (3.9) 


Phase 1 


T^(x) = K{A 1 (x) B^(x) } mod Qx-1) (x+1)] 


m l m 2 


u 1 (x) = {A^(x) B^ (x) } mod (x-1) 


= (aQ+ap (bQ+bj^) (see (A. 5)) 


«1 H 


6 i " u i ' “A 


U 2 (x) = {A^(x) B^(x)} mod (x+1) 


= ( W ( VV 

a 2 


(3.10) 


6 2 = u 2 = a 2 e 2 
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Phase 2 


c ( X ) = i = _i 

12 W m;L (x 2 ) m^-l) 


1 

2 


V 1 = 6 1 


v 2 = ^ mod ^ x+1 ^ = 


T^x) = K 


V f (6 l“ (S 2 ) (X_1) 


■ rt'w 


X + 


2 (^1+^2) 


The obvious choice here is K=2 so that the final result is 


c o = W c i = 5 r 6 2 


(3.11) 


The algorithm is summarized graphically in the tableau of Figure 1. The conven- 
tions adopted here are quite simple and are also the ones adopted in the more com- 
plex tableaus presented later on. We defined 3,. as a linear combination of 

the 

J i 

" ' ‘ • ' ' ■ " 's 


b. 's. 

J 

Similarly, we found that t. is a linear combination of the fi.'s. The t 

i J i 

column lists the coefficients of this linear combination. The left corner arrow 


The 3. row lists the (non-zero) coefficients of this linear combination. 

t , 


indicates that the B.'s are derived from the b. s and not the other way 

1 J 

around. Usually there is no directional ambiguity so that the arrows may be 
omitted. The equations 6^ = B^ct^ appear explicitly in the tableau using the 
Fortran multiplication symbol. 

Finally, recall that the basic eqn. (2.39) is fully symmetric with 
respect to {a^},{b^}. We have taken advantage of this in the terminology intro- 
duced here. Thus, coupled with each 3^ which is a specific function of (b^ } 
(say, 4>^((b_.})) spelled out in the tableau, there is the variable which is 

exactly the same function of {a^}, that is. 


“i ' *i(|f|) 


This convention is adhered to throughout the paper.. Practically, this means that 



Gl 

ES 

a 

a 

D 

m 

D 

D 

m 


E 

H 

El 

m 

D 

D 

0 

D 

D 


£i«*i ({bj) 
n = 2 (2Mi 4A) 


Fig. 1. Algorithm for LCT of order 2 
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a. 

the left part of the tableau has to be run through twice. First, with 

K 

replacing {b } and thus yielding {a^} instead of then we run through the 

full tableau with {b.} as input. Note, however, that in spite of the mathematical 
a^ J 

symmetry between {-r~-} and {b,}, practically, there is an important difference 

K j 

between them, a is considered a constant matrix transforming a number of dif- 
ferent data vectors b. Therefore, the a^'s may be precomputed once and for 
all, their computation being ignored in accounting for the cost of transforming 
one data vector b. With this in mind, we count only the number of explicit 
arithmetic operations in the tableau, arriving at 2 multiplications and 4 addi- 
tions, for which we adopt the designation (2M; 4A) appearing in the figure. 

Note that the 2 multiplications are the minimum prescribed by Winograd's theorem 
(right column in Table 1) . 

Left Circulant Transformation of Order 4 (Fig. 2) 

In this and all other tableaus derived in this paper, it is suggested 
that the reader consider the tableau as he follows its derivation, noting the 
graphical representation of each mathematical statement as the algorithm evolves. 


to 


a„ a 0 a, a„ 


b 

3 


3 2 10 


0 

to 


a^ a, a„ 


b„ 

2 


2 10 3 


1 

C 1 


a l a 0 a 3 a 2 


b 2 

c o 


a 0 a 3 a 2 a l 


b 3 


a i - T- 


A 3 (x) 


- ±v 

i=0 

3 

b 3< x > = ]C b i xl 


i=0 


(3.13) 


Phase 1 


T 3 (x) = K (A 3 (x)B (x)} mod{ (x 2 +l) (x+1) (x-1) } (3.14) 


m„ 


m„ 


u s(x) - A 3 (1)B 3 (1) - (aQ+a^+a2+a 3 ) (bQ+b^+b 3 +b 3 ) 

X 


a. 


C 1 = W c 2 = b l +b 3 


B 1 = c i +c 2 


= u 3 = a 1 B 1 
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Fig. 2. Algorithm for LCT of order 4 
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u 2 (x) = A 3 (-1)B 3 (-1) = (a Q -a 1 +a 2 -a 3 ) (b Q -b 1 +b 2 -b 3 ) 


cu 


. . 3 c ± c 2 


6 2 u 2 a 2 3 2 

u^(x) is evaluated in two steps: 

A 3 (x) mod (x 2 +l) = (£^-£3 )x +(£q-£ 9) 

03 04 

B 3 (x) mod (x 2 +l) = (b 1 ~b 3 )x +(b Q -b 2 ) 


3 3 3 4 


(from Table Al) 


(3.15) 


Not 


e that the tableau of Fig. 2 defines 3 3 »3 4 indirectly in terms of 


Thus, 


c„ = b -b ; c, = b -b 
3 13 4 0 2 


V = c 3 ; B 4 = c 4 


so that no arithmetic operations are involved in the last two equations. When- 
ever this is the case, we circle the relevant terms in the tableau to stress 
this fact. 

Now we combine the two results of (3.15) using Table A2 (a 3 = p^; 33= q^; 
etc.) getting 

V*) = E^ a 4 _a 3^4 + a 4^3 + ^4l] x+ &4^3 + B4)-(a 3 +a 4 )3 3 I] 

We introduce now 

o 5 = a 3 +o 4 ; 3 5 = 3 3 +B 4 = c 3 +c 4 

63 — ^ ^-*5 ^3 ’ ^4 — 2(04-03)34? (>3 — — 2 o 4 3 3 


u 1 (x) = - y(5 4 +6 5 ) x - -j^5 + <5 3 ) 


Hence, 
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Phase 2 


11 1 
n i C, » 


1 = A. =■ 1 

/I 1 


'12 m 1 (x 2 ) tn^-l) 2’ "13 m 1 (x 3 ) m^l) 2’ "23 m 2 (x 3 ) m (1) 2 


v (x) = u^x) = - Y(e^x+e 3 ) 


v 2 = ( (<$ 2 + ^4 X "*■ ^3 )"j} mod (x+l) = |-(26 2 -e^+e 3 ) 
v 3 mod (x-1) = |‘( 5 1 - 5 2 +e 4^ 


T 3 (x) = -|(e 4 x+e3 )+| (26 2 -e 4 +e 3 ) (x 2 +l) + ?( 6 r 6 2 +e 4 ) (x 3 +x 2 +x+l) 


Adopting 


K = 4 


T 3 (k) 




Introducing now 

we get the final result 


e l = 6 r 6 2 5 e 2 = 6 l +6 2 


T 3 (x) = (e jL +e 4 )x 3 +(e 2 +e 3 )x 2 +(e 1 -e 4 )x +(e 2 -e 3 ) 


(3.16) 


A count of arithmetic operations in the tableau (excluding the a, manipulations) 

^ -i- 

yields 5 multiplications and 15 additions , again realizing the multiplication 
minimum of Table 1. 














a 5 

a 4 

a 3 

a 2 

a l 

a o 


b o 

ft 


a 4 

a 3 

a 2 

a l 

a o 

a 5 


b l 

t 3 


a 3 

a 2 

a l 

a o 

a 5 

a 4 


b 2 

t 2 

= 

a 2 

a l 

a 

0 

a 5 

a 4 

a 3 


b 3 

rt 

h— * 


a l 

a o 

a 5 

a 4 

a 3 

a 2 


b 4 

t o 


a o 

a 5 

a 4 

a 3 

a 2 

a l 


b 5 




a . 

~ x 
a i “ K 

5 

A (x) = 7!a x 1 
3 i=0 

5 

B 5 (x) = 

i=0 


(3.17) 


The author wishes to acknowledge here the help of Dr. R. G. Lipes of 
JPL in reducing the number of additions from 16 to 15. 
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Phase 


u x (x) , 

B 


T^(x) = K{A^(x)B^(x) } mod{ (x^-x+1) (x^+x+1) (x+1) (x-1) } 


'1 


m 2 m 3 m 4 


(3.18) 


u 4 = A 5 (1)B 5 (1) -» a. 



C l = b 3 +b 0 ;C 2 = b 5 +b 2 ’ C 3 = V b l 


U 4 = 



11=0 

a. 


aj ( c i +c 2 +c 3^ 


5 1 U 4 a i 6 l 


u 3 = A 5 (-1)B 5 (-1) = ^“^o + ^l _ ^2 + ^3 - ^4 + ^5^v~ bo+b l"" b2+b3 "" bi!f+b5 ^ 


~ v ~ 

a 6 


C 6 = W C 5 =b 5 _b 2 ’ C 4 = V b l 


u 3 = a 6 (c 6 +C 5" C 4 ) 


6 = u„ = a p. 

6 3 6 6 


u 2 (x) are evaluated in two steps 


<-(x) mod (x^+x+1) = ) — (b^+b^)^! x+ 0' B 3 + ^o^ - ^ b 5 +b 2^ (f rom Table Al) 


c 3 

£ 

3. 


c 2 


C 1 


= (c 3 -c 2 )x+ ( C;l -c 2 ) = 3 2 x+B 3 


Ac-(x) mod(x +x+l) = c^x+a^ 
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u 2 (x) = (A 5 (x)B 5 (x)} mod (x 2 +x+l) = Ja 3 B 3 -(a 3 -oi 2 ) (B 3 ~B 2 )J X + ^3^3 -a 2^2^J (3.19) 

(from Table A2) 


We introduce now: 


3 7 = b 3 -3 2 = c r c 2 + v c 3 = c r c 3 ] 


a 7 a 3 a 2 


and use these to transform (3.19) into 


u 2 (x) = (a 2 B 3 +a 7 8 2 )x+ (a 7 B 2 +a 3 8 7 ) 


-e. 


c 4 c 6 

= “ ^ c 5 +c 4 ^ x_ ( c 6 -c 5 ^ = _e 5 x_e 4 


CS 


(3.20) 


(3.21) 


B (x) mod (x 2 -x+l) = - [Tb 5 -b 2 )+(b 4 -b 1 )]x- Jj;b 3 -b 0 ) - (b 5 ~b 2 Q (from Table Al) 


Ai-(x) mod (x -x+ 1 ) = -a 5 x-a 4 

u 3 (x) = (A 5 (x)B 5 (x)} mod(x 2 -x+l) = 0(« 4 + “ 5 ^ ^ 4 + ^ 5) -a 4 ^4 J* + ^“4 ^*4 ““ 5 ^ (3.22) 

(from Table A2) 


Introducing 

3 s = e 5 +B 4 = c 5 +c 4 +c 6 -c 5 = c 4 +c 6 | 

a 8 = W ) 

we transform (3.22) into^ 

(x) — (ot^ ^5 ^ x ( a 4^8 -0t 8^5^ 

Phase 2 


(3.23) 


(3.24) 


(m 1 (x)c 12 (x)} mod m 2 (x) = 1 


Let c 12 (x). = y 0 

.*. m 1 (x)c 12 (x) = (x 2 -x+l)( Yl x+ Y 0 > = Y 1 * 3+ (Y 0 -Y 1 )* 2 +(Y 1 -Y 0 >s+ Y 0 


(Y l“W Y l )x+ ( WV Y 1 ) = 1 

Y 1 = Y o = r C 12 (X) = l (x+1) 


(from Table Al) 


^The e^'s defined here, do not appear in the final tableau and are used only in the 
intermediate steps. 
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c 


13 


c 


14 


v 


1 


V 


2 


v 


2 


v 


2 


v 


3 


v 


4 


T 5 (x) 


1 = 1 _ 1 __ 1 = 1 = 
mi (x 3 ) m.^-1) 3’ °23 m 2 (x 3 ) m (-1) 1 

= — = 1* C = — = — = i. c = ji _ ^ _ Ji 

m;L (x 4 ) m^l) C 24 m 2 (x 4 ) m 2 (l) 3’ C 34 m 3 (x 4 ) m (1) “ 2 

e 4 X+e 5 


1 2 

|j[-e 3 x + e 2 ~ e 4 x - (x+l)[{ mod(x +x+l) 


1 

2 


-(e 4 +e 3 )x +(-e 4 -e 3 +e 2 -e 5 )x +(e 2 ~e 5 ) 


\ ]je 2 -e 5 )x +(e 2 +e 3 +e 4 -e 3 ) 


2 ,c 3 c 4 5 J 


mod(x +x+l) 
(from Table Al) 


{(<S 6 e 4 x e 5 ) 3 2 


(e 2 -e 5 )x +(e 2 +e 3 +e 4 -e 5 ) 


}mod (x+1) 


^(-3e 3 -e 4 -2e 5+ 2« 6 ) 


^{^( | 5 1 -e 4 x-e 5 ) --|[(e 2 -e 5 )x+(e 2 +e 3 +e 4 -e 5 )I ^ -j- -|(-3e 3 -e 4 -2e 5 +26 g ) }mod(x-l) 


6 (<5 r e 2 +e 3- e 4 +e 5' 6 6 ) 


,K 


( e 2- e 5 ) x +( e 2 +e 3 +e 4 -e 5) 


K(e 4 x+e 5 )+ 2 

+ ^(5 3 -e 2 +e 3 -e 4 +e 3 -6^) (x“’+x 4 +x J +x' i +x+l) 


(x^-x+l)+^(-3e 3 -e 4 -2e 3 +26 g) (x^+x^+l) + 


Collecting equal power terms yields the desired t/s. We make now the obvious 
choice K=6 and present the results in the tableau format in Fig. 3 a . We note 
here that the coefficients of t and t 3 have identical magnitudes. On the left 
of the bisecting line, the coefficients themselves are identical whereas to the 
right of it, they have opposite signs. This holds true also for the pairs 
(t^, t 4 ) , (t 2> t^) . Taking advantage of these symmetries, we can reduce the 
number of additions as shown in Fig. 3b. 

The final manipulations involve the elimination of the e^'s from Fig. 3b. 
This is based on their definitions (3.21), (3.24) and on a judicious application 
of (3.20), (3.23) as follows: 
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2 7 

s 6 -6 6 =e 4~ e 5 =a 5^ B 8 -B 5^ +2a 8 B 5~ Ct 4 B 8 = g 5^8^4^. + f 8^5"^. 

5 5 5 8 

g 2 -6 1 =2e 2 +e 3 -o 7 (8 3 -B 7 )-a 2 6 3 +2a 3 8 7 -S 3 (a 7 -a 2 )-8 7 (^i 2 -a 3 ) 

5 3 5 7 

'~ r r' 5 s 

g 3- 6 l“ e 2- 2e 3' a 7 8 2 +2a 2 6 3- <, 3 <B 3- S 2 ) -^VV- 6 3 (C, 7 H: ‘2 ) 

g 4 +8 6' e 5 +Ze 4" a 8 B 5 +2a 5 B 4' k, ‘4 (B 5 +S 4 ) ' B 4 (a 8 4<, 5 )+B 5 < “8' M ‘4 ) 


g 

This completes the derivation. 


(3.25) 

(3.26) 

(3.27) 

(3.28) 

(3.29) 

(3.30) 


The corresponding tableau derived in the first version of this paper contained 6 
extra additions. The subsequent appearance in print of the prescription of Winograd's 
algorithm [10], provided the clue to the specific manipulations ((3.25), etc.) 
utilized here to eliminate these extra additions. 
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bo 

1 





-1 




B 

B 

“ 

2 


B 

IS 

bi 



1 

-1 








B 

B 


B 

a 

b 2 


1 



-1 






1 



B 

B 

B 

b 3 

1 





1 




1 





B 

B 

b 4 



1 

1 






B 

B 

B 

B 

, 

B 

B 

b 5 


1 



1 






B 

fl 

B 

B 

B 

B 


c i 

C 2 

C 3 

V 

C 5 

C 6 




E 9 

IS 

EB 

ES 

ES 

ES 



1 

1 

1 




fix 

* a l = 

m 

B 

B 

B 

B 

n 





-1 

1 

.1 




*(-03-07) = 


B 


B 

B 

B 




0 

H 

■ 

■ 

■ 

■ 

Si 

*(o 7 -o 2 ) = 

fl 

fl 

B 

B 

B 





■ 

■ 

■ 


Q 

B 

SI 

*( a 8 + a 5 ) = 

m 

B 

B 


B 

B 







1 

1 


E 

*(<* 8 + a 4 ) = 

El 

B 

B 


1 


1 






-1 

1 

1 

^6 

* a 6 = • 

fl 

B 

B 


B 

B 

B 



1 


I] 





*(-“ 2 - a 3 ) = 

fl 

B 

B 










1 


1 

^8 

*( a 5 -a 4 ) c 

m 

_ 

_ 



B 

B 




n = 6 ( 8 M; 34 A ) 


Fig. 4. Algorithm for LCT of order 6 
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IV. The Basic DFT Algorithms for Prime N 

In this section, we apply the tableaus just derived to obtain the DFT 
algorithms for the odd prime terms of list (1.3), namely, 3, 5, 7. As indicated in 
Section I, these will serve as building blocks for higher order DFT's. 

We have seen in section II that with the proper relabeling, an N-th 
order DFT matrix displays an n-th order LC submatrix (2.7). The main part of the 
contribution of this submatrix to the overall transformation is spelled out in 
(2.15) repeated here 


n-1 



(g P ^) 

W vg 'b 


(p = 0,1, . . . ,n-l) 


(4.1) 


On the other hand, the LC tableaus of the last section are based on the (2.23), 
(2.33) formuation of the n-th order LC transformation. Therefore, in applying 
the LCT tableaus to the LC transformation expressed in (4.1), we must adopt the 
following identifications 

B p = Vli) (p = 0,1 n-1) (4.2) 

a =fiW (gI1 ~ 1 ’ P) (P = 0,1,..., n-1) (4.3) 

P 

Note the effect of (4.3). The LCT tableaus, being general, provide only a pre- 
scription for the computation of the ol's from the a^'s. However, since (4.3) 
provides an explicit formula for the a^'s, the a^'s may actually be computed. 
Specifically, the a^'s are expressible as 


a . = fie . 
i i 


(4.4) 


in which the e^'s are functions of ,i,N only and can thus be precomputed once 
and for all. 

We copy now the remaining equations of the relabeled DFT ((2.13), (2.19), 


( 2 . 20 )) 



B +B ' 

P P 


(p = 0,1, . . . ,n-l) 


(4.5) 


B p ^ b (0) 


B (0) fi(b (0) + 


(p 0)1) « « • ) n-1) 


gv 


(4.6) 

(4.7) 


Note that eqns. (4.6), (4.7) are valid only for the special case of prime N and 
are based on 
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n = N-l 


(4.8) 


Eqns. (4.1)-(4.5), on the other hand, are quite general and will also be applied 
in the next two sections where N is not prime. 

Our first step in the construction of the DFT tableau for prime N is 

the computation of the £^(N) constants. This is done by evaluating the a p ' s 

from (4.3) and then using them in the LCT tableau of order N-l to compute the 

a. 's, and hence the £ . ’s (4.4). 
i * 

The next step is to copy the LCT tableau of order N-l into the DFT tab- 
leau of order N. This is equivalent to the implementation of (4.1) and yields a 
tableau transforming the b^'s into the t j ' s . Now, using (2.9)-(2.11) , (4.2) 
we replace these variables by f^'s an ^ Fj ' s. The input and output index sequences 
we get at this point will usually be non-monotonic. Therefore, we now permute the 
rows/columns of the input and output squares to achieve index monotonicity. 

It should be pointed out that the variable changes (2.9)-(2.11) were 
introduced to expose the LC structure and thus allow the application of (2.39). 

Once this has been done, however, we want the resulting DFT tableau to be expressed 
in terms of the original variables F u >^ v with standard monotonic index sequences 
since this simplifies the integration of the tableau into the algorithm for 
larger N. 


We turn now to the implementation of (4.7) 
leaus satisfy 

n-l 

0 1 = £ bi ’ 

i=0 


Since all three LCT tab- 


(4.9) 


eqn. (4.7) is equivalent to 


F 0 = B (0) = Q(b (0) +6 1 ) = n < f (W (4.10) 

Finally, in order to efficiently realize (4.5), we examine the three LCT tableaus to 
see whether they contain a term which appears as a component with coefficient +1 of 
all t^'s. This term turns out to be 6^ . 

Hence, replacing 6 = cx^ 3-^ by 

5 1 = ^(O) 41 *! 13 ! (4.11) 

to B . Note, however, that the term fib... is 
P (0) 


will convert the output from B^' 
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already Included in computed in (4.10). This raises the possibility of 

eliminating one of the two multiplications evident in (4.11). Indeed, combining 
(4.10) with (4.11) yields (using (4.4)) 

5 1 = B (0) +fe 1 - 1 )^ 1 = F 0 +(e 1 -l)ne i (4.12) 

The multiplier of 8-^ is seen here to be the product of 9 , and a function of N. 

This turns out to be the general pattern for all 0 in all the DFT tableaus yet 
to be derived. Hence, we introduce now the notation 

q = (a).fi)e. (4.13) 

where ol is a function of i,N only. 8^ is always initially transformed as 
in (4.13). Hence one of our tasks in constructing the DFT tableaus is the deter- 
mination of the numerical values of the (a. constants. As we shall see in the 

l 

course of the developments, the underlying LCT tableaus in conjunction with (4.4), 
(4.13), always uniquely determine the ol's. In the case of (4.12) 

tq = e 1 ~1; q = B^+(u) 1 ^)8 1 = Fq+Cw^) (4.14) 

So far, we have considered the LCT-DFT transition in general terms. We 
turn now to specific cases: 

DFT of order 3 (Fig. 5) 2 tt 

Eqn. (4.10) is explicitly stated in the tableau. From (4.3) (with W = e 
g = 2) we get 


a o 

2 

_n 

r n 
w 2 

_ n 

w 1 


' 2 


2 


a l 

_ 2 _ 


w 1 


w 1 


(4.15) 


where W is the complex conjugate of W. Hence, applying the second order LCT 
tableau (Fig. 1) , we find 



Q 

2 ; 

1 

e l = - 2 ; 

u i" T 

i 

E 2= 

; w 2 = e 2 


-1 = - 
./I 

“ 1_ Z “ 


3 

2 
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0o 


/ * — , 


r (f 0 + / 3 r ) ft 

t 



0 

h 

^2 

1 


i 

1 

0i 

i 

-1 



*(-3/2)ft+F Q = 
*(iV^/2)il = 


a 

F i 

F 2 

Si 

1 

1 


-1 

1 


N =3 (3M ,6A) 


Fig. 5. Algorithm for DFT of order 3 


(f 0 + 0i) = F 0 


E 



fl 

D 

B 

■ 

1 
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■ 
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KB 

■ 

B 

D 


KB 

D 

_ 

_ 

-1 
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m 

s 

c 4 " 

B 

B 

■ 


51 

B 

H 



0 

B 

B 

roi 


0 





a 



D 

B 

e 


■Xcjg ft — 
*CJ 3 il - 
-*<44 ft = 

*{Ugil = 



l 



1 

F l 



1 

1 


f 2 



1 

-1 


F 3 


l 



-1 

F 4 



«2 

e 3 

e 4 


K 

l 

1 




0 

B 

B 




m 

fl 


1 



a 




1 


H 

B 


B 

D 



F 0 


<4 =- 5 . 

1 4 

o j 2 = (cos 36°+ cos 72°) 

w 3 = i (sin 36°+ sin 72°) 

w 4 = i (sin 36° - sin 72°) 

= - i sin 36° 


N =5 (6M ; I7A) 


Fig. 6 Algorithm for DFT of order 5 
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“i * e r 1 ' - f 

o>2 = £ 2 = “ Y (cos 36°+cos 72°) 

0)^ = 2e^ = i (sin 36°+sin 72°) 

0)^ = 2(e ^-e ,) = i(sin 36°-sin 72°) 

0)^ = -2 £4 = -i sin 36° 

DFT of order 7 (Fig. 8) 


■\ 


> (4.18) 


,2tt 
1 7 

W = e ; g = 3. Hence from (4.3) 


_a o 


[w 5 


w 2 

a l 


w 4 


w 3 

a 2 

n 

w 6 

ft 

w 1 

a 3 

6 

w 2 

6 

w 2 

a 4 


w 3 


w 3 

a 5 


w 1 


w 1 


(4.19) 


All a^'s are expressible in terms of the angle 

0 _ 90° (4.20) 

and its multiples. The prescription of Fig. 4 for the computation of the ot^'s 
is shown in Fig. 7 and the final tableau based on that is shown in Fig. 8. 




' v nv 1= -i 

; o; 2 = -^(a 3 +a 7 ) = -^(2sin0-cos 20 + cos 45) 

’ <jJ 3 = h^ a 7~ a 2^ ~ 3^' s ' n ® + 2 C0S 2S + co s45) 

’ aj 4 = a g) =_ 3 (cos 5 + 2 sin 25 + sin 45) 

; cu 5 = ^(a 8 + a 4 ) =-^(2cos5 + sin 2S-sin45) 

’ h a 6 =--^(cos5-sin 26 + sin 40 ) 

; cu 7 = -i(a 2 +a 3 )= -|(sin 5 + cos 25+ 2cos 45) 
; w 0 = i(a -a) =-4(-cos0 + sin 25 + 2sin45) 

o 5 4 ^ 


Fig. 7. Computation ofujj's for the DFT algorithm of order 7 
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-1 

F i 



-1 


F 2 

rH 

-1 



f 3 

1 

1 



F 4 



1 


F 5 




1 

F 6 

g 3 

«4 

g 5 

g 6 


1 





-1 





-1 






1 

1 




1 


1 



-1 

1 

1 









■ 

-1 

1 



N = 7 ( 9M;36A) 

; w 5 =--j(2cos0 + sin20-sin40) 
(2sin0 -cos 20 + cos 40) ; w 6 s '"k(cosfl-sin2^+ sin40) 
(-sin 9 + 2 cos 20 + cos 40) ; “> 7 = y(sin0 + cos 20+2cos40) 
(cos 0 +2sin20 + sin 40) •, w 8 *-^(-cos 0 + sin 20+2sin 40) 


Fig. 8. Algorithm for D FT of order 7 
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V. The Basic DFT Algorithms for N=4,9 

From (1.2) we see that with the tableaus of the last section, the 
highest realizable N would be 105(=3>5-7). With this in mind, we add now 
the trivial algorithm for N=2 (Fig. 9) thus increasing the maximal N to 
210 . 

If we want still higher N, we may either generate new tableaus for sue- 
cessively higher primes (11, 13, 17,...), or devise new tableaus for N=p 
(p prime). We adopt the latter course here. 

DFT of order 4 (Fig. 10 ) 


N=2^, yielding (2.7), 

The primitive root of 


n = 2 
4 is 


3. 


Hence, 


(5.1) 


p 

0 1 

r = 3 ^ mod 4 

1 3 


The number of rows excluded from the LC pattern. is 2. This number increases still 
further in the subsequent tableaus, reaching 8 for N = 16. This calls for some 
new and some modified terminology to facilitate handling the non-LC part of 
the matrix. 

First we extend the definition of r so that (2.10) will now include 

( 2 . 11 ) 


P 

0 1 (0) (2) 

r 

13 0 2 


Similarly, for the column indices a, s, we now have 


0 

0 1 (0) (2) 

s 

13 0 2 


(5.3) 


(5.4) 


In subsequent derivations, here and in the next section, we shall assume without 
explicitly so stating that the definitions of r,s have been extended, as 
indicated here, to cover all indices. Next, we complement (2. 13)— (2.15) with 






N =2 (2M; 2 A) 


Fig. 9. Algorithm for DFT of order 2 
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F' = B' ; F = B ; F = F +F' 
r p r p r r r 


(5.5) 


ITS 

Finally, recalling that the (r,s) element of the DFT matrix (2.4) is QW , 
we introduce the "exponent matrix" E, 


part . 


E 

r 

,s 

(r s) mod N 



(5.6) 

we find very convenient 

in accounting for 

the 

contribution of 

the non-LC 

In the present case 







s -> 

0 

2 

1 

3 



a -> 

• (0) 

(2) 

0 

1 



1 

0 

0 

0 

0 “ 

(0) 0 



0 

0 

2 

2 

(2) 2 

(5.7) 

E = 

0 

2 

1 

3 

0 1 


0 

2 

3 

1 

1 3 



1 

P 


1 

r 


Note that we have added here both kinds of row and column indices. One simple way 
of obtaining E is directly from its definition (5.6). In writing it down, 
we make sure that the non-bracketed 0 ,a indices would follow the sequence 
0,1,2,... . This will bring out the LC structure. The sequence of the other 
indices is immaterial. 

We observe that (5.7) displays a second order LC submatrix as was to be 
expected. We handle the effect of this submatrix with the LCT tableau of Fig. 1, 
starting with the evaluation of the a^'s 



(5.8) 


Hence (From Fig. 1) 


a l = 0; a 2 = iQ (5.9) 

We conclude that only the B 2 row contributes to F| and we copy it into the $ 3 
row of Fig. 10. 
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B 
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B 
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E! 
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D 
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B 
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M 

■ 

D 

B 

B 
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= 

*{2 = 


E 

m 

0 

m 

El 

0 

B 

B 

m 

Bi 

m 

B 

B 

B 

B 


B 

B 

B 

« 

m 

B 

B 

B 

D 


S3 

E9 

EH 

EH 


N = 4 (4M;8A) 


Fig. 10. Algorithm for DFT of order 4 
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Note that the output vector F appears scrambled in Fig. 10. In section IX 
we point out the advantage of certain tableau structures which allow storage economies 
in implementing the algorithm. The tableaus generated up to this point have both this 
desirable structure and an unscrambled output F. From here on, it seems impossible 
to realize both these desirable features simultaneously and we opt for the more impor- 
tant memory conserving structure. Hence, the scrambled output. The scrambling here 
is quite simple but becomes complex for N=16. To facilitate prescribing and hand- 
ling of this, we have added the vector r) to the affected tableaus. The scrambled 
F is identical with the unscrambled o* (see Fig. 10) 

We return now to the realization of the remainder of the E matrix 


(5.7) 


V F i = fl( 

e 2 


F 

r 


F'+6 

r 


2 



> (r = 1,3) 


J 


(5.10) 


2 

We have used here the fact that for N=4, W = -1. Equations like (5.10) can be read 
off directly from the E matrix. Such equations will henceforth be presented with- 
out any comment. 


f 2 = n{(f 0 +f 2 )-(f 1 +f 3 )} = Q6 0 - na 1 = <s 0 -6 1 
So Si 5 0 6 1 

f q = n{(f Q +f 2 )+(f 1 +f 3 )} = ne 0 + ^e 1 = 6 Q +6 1 
B 0 S x ^0 


This completes the derivation. 

DFT of Order 9 (Fig. 13) 

N = 3 2 . Hence (2.7), 


n = 6 

The primitive root of 9 is the primitive root of 3, namely, 2. This leads to 


(5.11) 


p 

0 

1 

2 

3 

4 

5 

r = 2? mod 9 

1 

2 

4 

8 

7 

5 


(5.12) 


Hence, the following E matrix 
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s 

a 


E 


->■036124875 
-> (0) (3) (6) 0 1 2 3 4 5 



( 0 ) 

(3) 

( 6 ) 

0 

1 

2 

3 

4 

5 

-t- 

P 


0 

3 
6 
1 
2 

4 
8 
7 

5 

r 


(5.13) 


With n = 6, Fig. 4 will be applicable here. We consider the c^'s first. The E 
matrix (5.13) clearly displays the aj^ sequence (this is also in agreement with 
(4.3)). Hence, 


V 


W 5 " 


V 

a l 


W 7 


w 2 

a 2 

ft 

w 8 

ft 

w 1 

a 3 

6 

w 4 

6 

w* 

a 4 


w 2 


w 2 

_ a 5_ 


_w 1 _ 


y_ 


(5.14) 


The computation of the a^'s 
nology of Figs. 4, 11 


and a) . ' s 

i 


is shown in Fig. 11. 



Note that, in the termi- 

(5.15) 


This means that in the realization of F! which is effected by copying Figs. 4, 11 into 

1 9 

Fig. 13, the terms associated with a^,ag, should be eliminated. In the actual 


Fig. 13 seems to indicate that these terms have not been eliminated. This is mislead- 
ing as the terms supporting this impression are those added later on. Winograd [jT]| 
appears to be unaware of (5.15). This leads to 2 extra multiplications in his algor- 
ithm for N = 9. 



£L\jj4 
6 ” 

1 





-1 



^i =- 3 cos 20° 

6 " 



1 

-1 





c 2 = § cos 40° 

1 

6 W 


1 



-1 



f 

yfsinlO" 

fw 4 

1 





1 


c.=-i§cos 10° 
4 3 

^-W 2 



1 

1 




/ 

c_=-i§sin 40° 
5 0 

6 


1 



1 


/ 


c 6 =-i^sin20° 

V. 


C 1 

C 2 

C 3 

C 4 

C 5 

C 6 

-t 



1 

1 

1 




a l 

ii 

(sin 10° -cos 20° + cos 40°) = 0 



-1 

1 




a 2 

~ 3 

(sin 10° -cos 40°) ; w g =- ^(a 3 +a 7 )=|;(sin 10° + 2cos 20°+ cos40°) 


1 

-1 





a 3 

_ a 

"‘3 

(cos 20°+ cos 40°) ; w 3 =i(a 7 -a 2 ) = -|(2sinl0 0 +cos20°-cos 40°) 






-1 

1 

a 4 

= -i 

§(sin 20°-sin40°h cu = i(a Q + aj =~4(2cos 10°+sin 20°+sin 40°) 

o AZ, O v> 5 





1 

1 


a 5 

=-i 

^{cos 10° +sin 40°) ; a> 5 =^(a 8 +a 4 ) =--^(coslO°+2sin20°-sin 40°) 





-1 

1 

1 

a 6 

= i§(coslO°-sin 20°-sin40°) = 0 


1 


-1 




a 7 

=“ (sin 10° + cos 20°) ; a> 7 =-^(a 2 +a 3 )=-|(-sinl0 0 +cos20°+2cos40 0 ) 





1 

. 


1 

a 8 

=-i^(cos 10°+sin 20°) ; cu 8 =i(a 5 - a^) = --i(coslO°-sin 20°+2sin 40°) 


Fig. 11. Computation of clIj's for the DFT algorithm of order 9 
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copying from these figures, we also apply some permutations in addition to the obvious 
ones involving the input and output variables. These permutations are spelled out in 

(Fig. 12). Note that the permutation s i ^ j is shown in two sequential steps: 

8i e j 5 e j Sj • (Example : gj+e^g ) 9 a 



Fig. 13 
index (j) 


Fig. 12. Index permutations in assembling Fig. 13. 


We turn now to account for the rest of the matrix using (5.13) as our guide. In doing 
that we encounter the following constants 


QW 3 = Q(-sin30°-i cos30°) = -fi(-j+ i -—■) 

nw 6 = fiW 3 = -S(j- i ~) 

which we use in the following. 

r = 1; 4; 7 

F r -F r =fi{f 0" ( 2 + 1 T ) f 3 " ^2 ~ 1_ 2^ f 6^ 


= n{ fg - |(f 6 +f 3 ) + if ( VV = 

^0 ^3 ^6 X 5 3 

= V S 3 - 

e 3 e 6 


In the case of , we 


introduce two canceling sign changes. 


9a 
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*• F r = F r + e 3 -e 6 

We withhold implementation and proceed to the next group 

r » 2; 8; 5 

^ ^ 

Compared to the previous case we have here interchange of W with W (=W ). Hence, 


F = F' + e,+e, 
r r Jo 


At this point we implement both groups as shown in Fig. 12. 

Next we implement F^jFg. From (5.13), 

F 3 - n { <£ 0 +f 3 +£ 6>-<i + £ f> £ f > < £ 2 +£ 8 +£ 5 ) } 

- a {fo + jW -?[ ( , t 8 + !l > ' £ »7* f 2 ) ' £(f 5 +£ _4 ) 1 + 

% 


3 


0 ^3 C L C 2 

,/3 

2 K1 " 7 Ll> 

^8 '^7 c 5 

/J 


+ i-^ [(f 8 -£ 1 >-(f 7 -£ 2 )+(f 5 -f 4 >]}. 




- n3 0 + ^33 “ 2^ c i +c 2 +c 4) + i_ 2~ ^( c 5 -c 7 +c 3) = (6Q+26 3 )-(^8j.)-(-i 2fl3g) 


% 2 ^ 


h 


e ° ^ e ° 6l ^8 8 i 8 8 
e l e 8 g l s 8 

3 

We turn now to F-. The only difference here is the interchange of W with 

£ O O 

W (=W ) . Hence 

F £ = g +g 

6 6 i & 8 

Finally, 

+ 2 fi = e o +2ei = g o 

e l 

This completes the derivation. 

With the two additional tableaus developed in this section, the maximal 
realizable N has been pushed to 1260 (4 - 5 - 7 - 9) . We push it still higher (5040) 
with the tableaus of the next section. 


F 0 = n(f 0 +f 3 +f 6 )+Q8 1 =S13 0 +^6 3 +Q8 1 = (6 0 + 26 3 ) 


0 26, 26 i 
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m 

■ 

D 

■ 

9 

■ 

■ 

■ 

■ 
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« 8 
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1 


0 

m 

0 

s 

0 

0 

0 

0 

0 




0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 




^2 

^3 

% 


\ 

^ 7 

^8 










0o 

*X2 = 

*d/2)a = 

*oj 2 f2 = 

*{l/2)£L = 

* oj 4 X2 = 

*(-iv/3/2m= 

* (jjjQ. = 

*(-i^/2)X2= 

-K-CU 12 = 

10 

So 
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l 
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0i 

Si 
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l 



-1 






8 2 
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© 






03 

s 3 

2 



-1 
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-1 





04 

s 4 



l 







l 







1 
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05 

s 5 








l 



l 







© 



h 

s 6 







© 










-1 
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s 7 






1 


l 









1 


-1 

1 

0„ 

Sg 













l 

-1 







09 

S 9 





-1 





l 









1 

1 


8 io 






-1 





l 


o» 2 = -i(2sin 10°+ cos 20°-cos 40°) j w 7 =~l(2cos 10°+ sin 20°+ sin 40°) 
cu( 4 = | (sin 10° + 2 cos 20°+ cos. 40°) . <*» 9 = ^{-sin 10°+ cos 20° + 2 cos 40°) 

0J 5 = "i( cos 10° + 2sin 20° - sin 40°) ; w 10 =--3;( cos 10°-sin 20° + 2sin40°) 

Fig. 13. Algorithm for DFT of order 9 


N = 9 (11M-, 44A) 
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VI . The Basic DFT Algorithms for N=8,16 


In developing the tableaus for N = 8, 16, we face a complication due 
to the fact that 

N = 2 k (k > 2) (6.1) 

has no primitive roots. In other words, unlike the N=4 case, there is no integer 
whose powers (mod N) would generate all of the odd numbers in the interval (1,N) . 

We can, however, generate half of these with powers (mod N) of the number 3 
We proceed now to modify the relabeling scheme to handle this case. First we 
modify (2.9) to read as follows: 


r = 3^ mod N) „ 

> (p ,a = 0,1,... j- 1) (6.2) 

s = 3 mod N) 


r = -3 P mod N) - 

) (p,a = 0,1,... f-1) (6.3) 

s = -3 mod N) 


Note that we have introduced here (6.3) a new type of index so that we now have three 
kinds: (i),i,i. We illustrate this with the case N=8 


p 

0 

T 

0 

1 

r = 3^ mod 

8 

- 

- 

1 

3 

r = -3 ^ mod 

8 

7 

5 

- 

- 


(6.4) 


It should be stressed that the bar or parentheses have no effect on the 

numerical value of the index. Their only effect is on the choice of the functional 

relationships connecting the new entities b^jB to the original variables f ,F^_. 

2 _ P s r 

Consider for example the expression g B. . If we evaluate it for i=l, its value 
1 ] 1 

is g B = g F (see (6.4)). If, on the other hand, we want its value for i = T 
1 1 -> 

we get g B r = g F 5 - 

Eqns. (2.10), (2.11) now read 

F r = B p’ f s = b a (r ’ S 0dd) 

F r = B (r) ’ f s = b (s) (r ’ S even) 



(6.5) 
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so that the overall r,p functional relationship can be summarized as follows 


p 

0 10 1 

(0) 

(2) 

(4) 

(6) 

r 

7 5 13 

0 

2 

4 

6 


with an identical table relating s to a. 

Eqns. (6.2)-(6.5) are now 
analogous treatment in section II.) 


Eqns. (6.2)— (6.5) are now applied to eliminate F u ,f from (2.4). (See 


(2t) 


1-1 I 

(n-2) 0 , 4 a 4 a 

F = n E W 2mt b + (2 I] w 2 3 b + n E W 2t3 b 

2t m=(0K(2)... m a% ° 0=0 


(t = 0,1,... | -1) (6.7) 


A /\ /\ ^ ^ 

B = B + b' ; F = B ; F' = B' 
P P P * r p* r p 


(r odd) 


( 6 . 8 ) 


(N-2) 

n E 

m= (0) , (2) , . 


w ' ra3 b 


B = 


(N-2) p 

n E b 

m=(0) , (2) , . . . 


m 


m 


(p=0,l,...,|-l) 


(p=o,i,...,| -1) 


> (6.9) 


B = 
P 


1-1 »-i 

4 /oP"FO\ ^ _ /oP"F^7\ 

n E w < 3 b + q E w (3 u 

o=(J 0 0=0 


f- 1 


la £ «- (3P+ °>b„ + a E ‘p - 0 - 1 f- 13 


— _ 1 

4 /nP+O. 


0=0 


o=0 


(P=0,1 | -1) 


N 




y (6.10) 


It is obvious that all four matrix products appearing in (6.10) are Hankel 
transformations (p+o) of order We prove now that they are also LCT's. To do 
this, we must show that (see (2.2), (2.16)) 


, N , 

P"F 4 “1 n _i v 

3 = 3 P mod N (N = 2 ; k > 2) 


( 6 . 11 ) 
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or equivalently, 


N 

3 4 = 1 


mod N 


(N = 2 ; k > 2) 


( 6 . 12 ) 


n 

,4 


We prove (6.12) by induction on k. Assume it to be true for N=n, then 3 = mn+1 

ii (2n) 

(m integer). Squaring yields 3^ = (m2^-) (2n) +m(2n)+l so that 3 4 =1 mod (2n) 
and (6.12) is true for N = 2n. Finally, (6.12) is obviously true for k=3. 

Thus, (6.10) involves four LC matrix products. Furthermore, these four LC 
matrices comprise a second order compound matrix which is an LC itself. All 
these features stand out quite clearly in the two cases to be developed now. 

DFT of Order 8 (Fig. 14) 

Applying the permutation of (6.6), we get the following E matrix 


s ■* 
a -> 


E = 


0 

4 

2 

6 

7 

5 

1 

3 



(0) 

(4) 

(2) 

(6) 

0 

I 

0 

1 



0 

0 

0 

0 

0 

0 

0 

0 

(0) 

0 

0 

0 

0 

0 

4 

4 

4 

4 

(4) 

4 

0 

0 

4 

4 
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2 

2 

6 

(2) 

2 

0 

0 

4 

4 

2 

6 

6 

2 

(6) 

6 

0 

4 

6 

2 

1 

3 

7 

5 

0 

7 

0 

4 

2 

6 

3 

1 

5 

7 

I 

5 

0 

4 

2 

6 

7 

5 

1 

3 

0 

1 

0 

4 

6 

2 

5 

7 

3 

1 

1 

3 


+ 

r 


(6.13) 


We turn now to an explicit representation of the submatrix corresponding to 
the lower right quarter of E. Denoting 


we get 


w, = fiW 
k 


(6.14) 




— 



•- 


W 1 

W 3 

W 7 

W 5 



w 3 

W 1 

W 5 

w 7 

F i 


W 7 

W 5 

W 1 

W 3 

F 3 


w 5 

W 7 

W 3 

W 1 


(6.15) 
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We see here four second-order LC submatrices, two each, of types a^, a^ . Using 
the terminology introduced here, we write down the equivalent second-order compound 
matrix 



(6.16) 


Note that (6.16) is also an LCT. We propose now to evaluate (6.16) through a direct 
application of the tableau of Fig. 1. Some elaboration is in order here. The 
tableaus in this paper have been derived with the implied assumption that the 
variables appearing in them are all scalars. This is not really necessary. Review- 
ing the derivations, one concludes that the tableaus are also valid under the 
following generalized interpretation: 

1. The row (column) elements (f . ,c . , 3. , <5. ,e. ,F. , . . . etc.) are column submatrices 

i i i i i x 

2. a i ,a i’ ^ are s 9 uare matrices 

3. a), is still a scalar constant 

l 

4. The tableau entries 

B *a = 6 ; B ft =6 

x x A 10 
should be interpreted as the following matrix products 


a. B. 


6 . ; w.ftB. = 6 . 

i i i i 


We introduce this generalization here with the immediate goal of evaluating (6.16). 
However, its importance transcends this immediate application. It is this generaliza- 
tion which elevates the tableaus from the rather theoretical realm of fast algorithms 
for very low order DFT's into the very practical realm of high-order, high-speed 
DFT algorithms. The specific way in which this is done is presented in detail in 
section VII . 

We return now to the evaluation of (6.16). Applying Fig. 1 we get 


6 o *1 




C' A 

t i t o 




A 


1 1 

h 

* “l = 

6 1 

1 1 

1 -1 

A 

^2 

* a 
a 2 = 

6 2 

-1 1 


(6.17) 


This "modified interpretation" can be avoided by using row matrices instead of 
column matrices. Note in this context that the ft matrix we will be concerned with 
is symmetric. 
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We turn now to the realization of the remaining parts of (6.13) 

r = 1;5 

F r = F ; + ^{< £ Q- f 4 ) + i( VV> * F r + H(Vic^) - f ; +^34 " F r + h 
c 4 c 6 g 4 

' r = 3 ; 7 

F r = F ; + n( c 4 -ic 6 ) = F ;+g 6 

^6 
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F & = fi{(f 0 +f 4 ) - (f 2 +f 6 ) - i(f 7 -f 1 ) + i(f 5 -f 3 )} = Q{ (c Q -c 2 ) + i(c 5 -c 7 ) } 

'V ""^5 “ 

F 6 = fi ( e 2 +ie 5^ = ^5 = s 5 
F 2 = n(e 2 -ie 5 ) = ftB 2 = g 2 

F 4 = fl{(f 0 +f 4 ) + (f 6 +f 2 ) - (fy+fj) - (f 5 +f 3 )} 

- — . — I - - — I — - — 

C 0 C 2 C 1 c 3 

= ^(e Q -e 3 ) = Qg 3 = g 3 
&3 

F 0 = «(e 0 +e 3 ) = nB 0 = g Q 

e o 

This concludes the derivation., 

DFT of Order 16 (Fig. 17 ) 

The permutation is controlled by the following table 


p 

0 

T 

2 

3 

0 

1 

2 

3 

r =3^ mod 

16 

- 

- 

- 

- 

1 

3 

9 

11 

r = -3 ^ mod 

16 

15 

13 

7 

5 

- 

- 

- 

- 


Applying this we get the following E matrix 


= n{ ( cq+c 2 )-(c 1 +c 3 ) } 



(6.19) 
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4 

8 

12 

2 

6 

10 

(4) 

(8) 

(12) 

(2) 

(6) 

do: 

0 

0 

0 

0 

0 

0 

0 

0 

0 

8 

8 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

8 

8 

8 

8 

0 

8 

4 

12 

4 

8 

0 

8 

12 

4 

12 

8 

0 

8 

4 

12 

4 

8 

0 

8 

12 

4 

12 

12 

8 

4 

14 

10 

6 

4 

8 

12 

10 

14 

2 

12 

8 

4 

14 

10 

6 

4 

8 

12 

10 

14 

2 


4 

8 

12 2 

6 

10 

14 

12 

8 

4 6 

2 

14 

10 

4 

8 

12 2 

6 

10 

14 

12 

8 

4 6 

2 

14 

10 


12 


13 

7 

5 

I 

2 

3 

0 

0 

0 

4 

12 

4 

8 

. 8 

8 

12 

4 

12 

10 

14 

10 

14 

10 

14 

2 

6 

2 

6 

2 

6 


2 

6 


9 11 15 


0 0 0 

12 4 12 

8 8 8 

4 12 4 

__ _ _ 

2 6 2 


9 

11 

1 

3 

7 

5 

15 

11 

1 

3 

9 

i 5 

15 

13 

15 

13 

7 

5 

1 

3 

9 

13 

7 

5 

15 

3 

9 

11 

7 

5 

15 

13 

9 

11 

1 

5 

15 

13 

7 

11 

1 

3 


3 7 
9 TT 


( 0 ) 0 
(4) 4 

( 8 ) 8 
( 12 ) 12 
( 2 ) 2 
( 6 ) 6 


10 

14 

10 

14 

(10) 

10 

14 

10 

14 

10 

(14) 

14 

15 

13 

7 

5 

0 

15 

13 

7 

5 

15 

T 

13 


3 11 

+ t 


In view of the complexity of the present case, we derive the tableau in two 
stages. F^, the contribution of the LC submatrices , is considered separately in the 
intermediate tableau of Fig. 15 which is then copied into the final tableau of Fig. 17. 
Following the previous case, we write down explicitly the LC part 




m 

IS 

m 

IS 

m 

IS 

m 

m 


■ 

m 

■ 

■ 

■ 

u 

■! 

■ 

m 

■ 

■ 

m 

■ 

m 

■ 

■ 

■ 

■ 

m 

m 

m 

u 

■ 

■ 

■ 

■ 

m 

■ 

m 

[0] 

m 
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r 

*15 

A 


*13 

rt 


*7 



n 


r 

F i 

A 

c o< 


F' 

3 

p' 

f 9 


< 

F' 

11 



w 







L 






\ 

1 — 1 

ts 

W 3 

W 9 

W 11 

W 15 

W 13 

W 7 

W 5 

W 3 

W 9 

W 11 

W 1 

W 13 

W 7 

W 5 

W 15 

W 9 

W 11 

W 1 

W 3 

W 7 

W 5 

W 15 

w 13 

W 11 

W 1 

W 3 

W 9 

w 5 

W 15 

W 13 

W 7 

W 15 

W 13 

W 7 

W 5 

W 1 

W 3 

W 9 

W 11 

W 13 

W 7 

W 5 

W 15 

W 3 

W 9 

W 11 

W 1 

W 7 

W 5 

W 15 

W 13 

W 9 

W 11 

W 1 

W 3 

W 5 

W 15 

W 13 

W 7 

W 11 

W 1 

W 3 

W 9 







that is, 


e i 


/s 

a l 

A 

a o 


— A 

b o 

A 

fc o 


/s 

a o 

/V 

a l 


/\ 

b l 


( 6 . 21 ) 


( 6 . 22 ) 


We handle (6.22) via the tableaus (6.17), (6.18). It is obvious from (6.18) that 
a 1 ,a 2 will be LC matrices. Hence, it is sufficient to compute their first rows 
only. We express these in terms of the basic angle 


[First row of ^ [cos 0 

[First row of [] = ifl [sin 0 



sin 0 
cos 0 


= 22.5° 

-cos 0 
-sin 0 


-sin 0] 
-cos 0] 


Turning to (6.17), we get 


_f i5 +f r 


C 1 


f 15- f l" 


c 15 

f 13 +f 3 


C 3 

; § 2 = 

f 13~ f 3 


C 13 

f 7 +f 9 


C 7 

f 7 - f 9 


_C 9 

_ f 5 +f n _ 


c 5 


_ f s - hi 


~ c ll 


(6.23) 

(6.24) 

(6.25) 


(6.26) 
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The next step in (6.17) calls for the evaluation of two LCT's of order 4, namely, 


6. = ct. 6. (i = 1,2) 

i ii 


We adopt now the following notation for the components of 


A 



(6.27) 


(6.28) 


These will be determined now by applying the tableau of Fig. 2. 

■A A A 

Implementation of = a^ B-^ 


n 

4 

sin 

9 

1 



1 

• 


ft 

“ 4 

cos 

0 


1 

1 


■ 


ft 

4 

sin 

0 

1 



-1 



ft 

4 

cos 

0 


1 

-1 







0 

0 

ft - 
- -^cos0 

ft . Q 
-2 sln 9 






1 

1 



0 





1 

-1 



0 







© 


ft ' 

- -^cos 0 

= e^ft ;(j0^=2e^g=- (cos0+sin0) 







© 

ft . Q 
- y sm0 

=£yft;tOy=2(e^-e^) = (cos0-sin0) 






1 

1 

- Y( cos 0 +s i- n 0) 

= e 1£ ft: (jj,,=-2e-,= sin0 
16 16 7 


Note that the vanishing of the first two multipliers means that only a portion of 
Fig. 2 is required, namely, the portion involving the rows of B^jB^.B^- This is 
copied into rows of B^.BysB^* respectively, in Fig. 15 (Hence the adopted 
0 ) indices). Note further that Fig. 15 shows two alternative paths leading from 
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g. to F'. The upper path should be ignored for now. The preceding discussion 
r r ^ ^ 

has derived that portion of the tableau leading to o^_. . We turn now to 6 2 j • 

* £ ' 

Implementation of 0 0 = a 2°2 



The situation here is very similar to the case. The rows of of 

Fig. 2 are now copied into the rows of respectively, of Fig. 15. 

A A ^ ^ 

With ^^>^2 now ava H- a ble> we apply (6.17) to obtain t^,tQ as shown in 
the lower part of Fig. 15. The transition from g_^ to F^ is shown there requiring 
8 additions. The upper part realizes the same transformation with only 4 additions 
and is the version copied into Fig. 17. The identity of the two paths can be easily 
verified by inspection. For example, the upper part prescribes F^ = g^+ 6^5 but so 
does the lower part. Verifying such agreements for all 8 outputs, establishes the identity. 

We implement now the remaining parts of (6.20). To bring out clearly the 
various symmetries we are exploiting here, we show an explicit form of the remaining 
part of the permuted DFT matrix in Fig. 16. It is shown here for convenience as the 
sum of two matrices and expressed in terms of the constant 

1 


(6.31) 




( 0 ) 0 

(4) 4 

( 8 ) 8 

( 12 ) 12 

( 2 ) 2 

( 6 ) 6 

( 10 ) 10 

(14) 14 

6 15 

1 13 

2 7 

3 5 

0 1 

1 3 

2 9 

3 11 

t t 

P r 


<J1 

CO 
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We define now 


c 14 f 6 f 14 ; C 12 f 4~ f 12 ; C 10 f 2 _f 10 ; C 8 f 0 f 8 

and proceed with the evaluation as follows : 

r = 1;9 

F r _F r = ^ ^ c 8 _lc 12^ " Y ^ c 14 -C 10^ + 1 ^ c 14 +c 10^ = ^^12 ie 10 ) 


8 


12 


'10 


"14 


12 


10 


^12 iY ^10 8 12” ^10 g 12 SfO h 12 

g 12 6 10 S 10 


F = F'+h 10 
r r 12 


r = 5 ;13 


F r - ?; + (g 12 +s 10 > - F;+h 10 




h 


10 


r = 3; 11 


F r" F ; = fijCcg+ic^) +Y[ (c 14 -c 1Q ) - Kc 14 +c 10 ) 1^ ^B8-i^( e 14 +ie 1Q ) 


'10 


'14 


14 


■ .V **»14 ■ s 8 ‘ 6 14 ■ S 8' S 14‘ h 8 


8 8 6 14 


S 14 


F = F’+h„ 
r r 8 


r = 7 ; 15 


F r -r; + (g 8 +g 14 ) = F;+h u 


‘14 


We define now 


c 0 “ f 0 +f 8 ; C 2 f 2 +f 10 ; C 4 f 4 +f 12 ; c 6 f 6 +f 14 
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g 4 = 6 4 = ftB 4 = ^(c Q -c 4 ) 

and turn to the next group of rows (F. ,F. ,F- n ,F n . ) 

Z o 1U 14 

+ i(c 15 -c 9 ) + l( c 13 -c 11 )]] 
e 9 e ll 

= g 4 - ±^3 6 +r^(e 1 -e 3 )+l’Yfl(e g +e 11 ) = g 4 - Sg +yf2B 3 + ±Y^Bg= g 4 -g 6 +S 3 + 5 9 


F 2 = g 4 +f2{-i(c 2 -c 6 ) + y £(c 7 +c ; l ) - (c 5 +c 3 ) 

e i e_ 



= (g 4 -S 6 ) + (g 9 +g 3 ) “ h 4 +h 3 



g 


9 


Noting which columns in Fig. 16 are associated with each of the four g^'s comprising 
F 2 , we observe that F^ , F^q> F^ 4 involve the same g^'s but with different sign 
combinations. Specifically, 

F 10 = ^ 8 4 _S 6^ “ ^ S 9 +8 J > = h 4 _h 3 

F 6 = ( - g 4^ g 6 ' ) + ^ g 9 -g 3^ = h 6 +h 9 
h 6 h 9 


14 


= (g 4 +g 6 ) - (S 9 


-gj 


= h,-h 
6 9 


Finally we consider the upper four rows 


F 0 = ft {^ c o +c 4^ + ^ C 2 +C 6^ + ^ c 7 +c l^ + C C 5 +C 3 ) 1 

e 0 e 2 e l e 3 

= ftBg +JJP, = ^0 + ^1 = 8 0 +8 1 = h 0 + 

6 0 5 1 8 0 S 1 h 0 ^1 



• * F 8 h 0 h l 
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= ^{(e n -e 2 ) + iB c 1 5“ C 9)"(c 1 fC 11 Q} = ^3, + ift(e q -e i;L ) 


13 _11- 

'11 


'9 U 


= + 6 n = h o + gn = h o + h n 


’ll 


11 


,F 12 h 2 h ll 


This completes the derivation. Note that the size of this tableau 
notational peculiarities. In particular, 0 )(i) = B(i) = 3^, 


■ 5 2 + inB n 



imposes certain 
etc . 
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VII. 


The DFT Algorithm for 


N = 


K 



11a 


At this point, we finally are in possession of DFT tableaus for all 
orders listed in (1.3). Our immediate goal is their integration into a DFT 
algorithm of order 

K 

N = n (N ' s relativel y prime) (7.1) 

k=l 


In the subsequent derivations, we will also need the following partial products 
of these factors 


v. 

i 


< 


n 

k=i+l 


N 


k 


V K =1 


- n n = a 


k=l 

k^i 


N. 


(0 < i < k) 


(7.2) 

(7.3) 


Let us introduce now a set of matrices which figure prominently in 
the development of the algorithm. Consider a matrix whose order N satisfies 

(7.1) . We denote by £2^ its upper-left submatrix of order v^. Note that 

(7.2) implies the existence of a sequence of these submatrices 

, . . . 

0 1 K 

The matrix order decreases to the right. f2^ on the left is of order N and 
is thus identical with the overall matrix. 12^ on the right is of order 1 
and is thus the upper-left element of the overall matrix . 

The second item we introduce here is the graphical representation of 
the tableaus shown in Figure 18. We describe this in terms of the generalized, 
compound-matrix, interpretation of the tableaus (see discussion following 
(6.16)). Let the tableau variables f ^ , F^ , . . . etc. represent m- 
dimensional vectors. Then, the tableau of order N is applicable to any matrix 
transformation of order mN in which the transforming matrix, Y, when regarded 
as a compound matrix of order N, is the N-th order DFT matrix (2.4). In this 
11a 

The basic idea underlying the developments in this section is commonly 
attributed to I.J. Good |_ll] . Here, we find it more convenient to avoid 
Good's explicit use of the Kronecker matrix product. 



case, the parameter of the tableau is the m-th order upper-left submatrix 
of Y. 

Examination of the tableaus reveals that they all consist of three 
parts corresponding to three distinct phases of the algorithms they describe. 

In phase 1, the mN-dimensional input vector f is operated upon to yield the 
m-dimensional 3^ vectors. In phase 2, a scalar multiple of the fl submatrix 
transforms 3^ into £^ according to (4.13) (£_^ = i = 0,1, • . . ,M-1). 

M is the number of multiplications appearing in the tableau designation. Obvious- 
ly, this is also the number of (L vectors generated in phase 1. Finally, 
in phase 3, the m-dimensional vectors are operated upon to yield 
the mN-dimensional output vector F. The three phases are represented 
schematically in Fig. 18. The conventions adopted here are as follows: 

All lines represent vectors. We may attach an integer to a line to 
indicate the dimensionality of the vector it represents (see Fig. 21). 

Phase 1 is represented by a circle, phase 3 by a square. In either case, the 
symbol inside designates the matrix transforming the "circle input" to the 
"square output". Note that the only arithmetic operations involved inside 
either the circle or the square, are additions and subtractions. 


With these preliminaries out of the way, we are ready now to consider 
a set of scrambling guidelines that would allow a simple implementation of the 
DFT of order N satisfying (7.1). Denoting the scrambled matrix by Y, we pro- 
pose the following set of sufficient conditions: 

1. Y(=flg) should be a compound DFT matrix of order N-^. This will 
allow the implementation of phase 1 of the N^ tableau. Phase 2 
calls for the determination of (uhQpB^ Hence, 

2. Q 1 should be a compound DFT matrix of order N 2> We apply phase 

1 of the N 2 tableau and are led, in phase 2 to a transformation involv- 
ing ^2* Hence, 

j [/ 3 - should be a compound DFT matrix of order N 3 and so on down 

t0 ^K— 1 wbicb should be a DFT matrix of order N . 

1L 

For most l values, £ ± = (L of the tableaus but not for all of them. For 

example, the N = 5 tableau implies £ = F„, S = £ +£ 

^0 O’ 1 S 1 s 0‘ 
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PHASE 2 


£r w i 


UPPER-LEFT 
SUBMATRIX OF Y 





Fig. 18. Schematic representation of the basic DFT tableaus 
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All this can be summarized as follows: A convenient relabeling scheme 

would be one which satisfies the following set of K constraints: 


Submatrix 51 of Y should be 
k-1 

a compound DFT matrix of order 

N, . This should hold for all 
k 

1 < k< k 


(7.4) 


While an algorithm following the above outline would be quite convenient to 

implement, it is not at all clear that a scrambling scheme that would simultaneously 

satisfy all K constraints of (7.4) does, in fact, exist. 

We proceed now to develop a relabeling scheme which comes very close 

to (7.4), adding only a minor complication to the above algorithm outline. We 

12 

start with the standard DFT matrix ((2.4) with Q = 1) . 


Let • 



N-l 


v=0 



(u=0,l, . . . ,N-1) . 


(7.5) 



(7.6) 


where 

u = X(m) ; v = X(n) (m,n = 0,1,..., N-l) (7.7) 
and the function X is yet to be specified. This transforms (7.5) into 


Q w A(m)X(n) 

m / . n 

n=0 



Y q 
mn n 


(m=0,l, . . . ,N-1) 


(7.8) 


The function X will be defined in terms of a modular representation 
D1 of u,v. In other words, the relabeling depends on the entities 


u^ = u mod N^ 
v^ = v mod 


(k= 1,2,. . . ,k) 


(7.9) 


According to the Chinese Remainder Theorem |~ 7~j , these remainders uniquely deter- 
mine any 0<^(u,v)<N. Hence, we may adopt the following representation for u,v 


12 ~ 
we use here F ,f 

variables F„,f . L 

U 7 v 


to distinguish these entities from the tableau 
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U = ( u 1 » u 2>* • ‘ » u p 

v = ^ v i » v 2 * * * * ’ v kP 


(7.10) 


The function X is now defined in terms of the following combination 
of (7.7), (7.10) 


v = (v 1> v 2 , . . . .v^) = X (n) 


(7.11) 


X(n) should be such that as n follows the sequence 0,1,...,N-1, each 

should follow a periodic repetition of the sequence 0,1, . . . ,N^-1, starting 

with v^=0. should be stepped with every V^-increment of n (see (7.2)). 

This means that v^ varies very slowly, v 2 varies faster and so on, up to 

v which varies in step with n. 

K r 

To illustrate this scrambling, consider the following example 

K = 3; N = 8; N 2 = 3; N 3 - 5; N = 120 (7.12) 

for which part of the index sequence would look as follows: 


n 

V 

• 

3 

• 

• 

48= (0,0,3) 

4 

24= (0,0,4) 

5 

40= (0,1,0) 

6 

16= (0,1,1) 

62 

12= (4,0,2) 

63 

108= (4,0,3) 

64 

00 

■p* 

ll 

/— N 

o 

S-/ 

65 

100= (4,1,0) 

66 

76 = (4,1,1) 

• 

• 


One could use modular arithmetic subroutine's to determine this sequence on a com- 
puter. Alternatively, it could be determined in a scheme using neither computers 
nor computations. All that is called for is some tedious writing down of 
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repetitive sequences. We illustrate this method for our example (7.12) in Tables 
2,3. First we write down the sequence 0,1,..., N-l as shown in Table 2 under 
the heading v. Then we write down next to this column, under the heading v^, 
a periodic repetition of the sequence 0,l,...,N k ~l and repeat this for all 
l_<k<^K. This yields representation (7.10) for all v's. The particular 
arrangement in Table 2 saves some writing and is convenient for the next step 
which is just a reordering of Table 2 in the desired sequence. 

To simplify the process, we let Table 2 determine the order in which 
Table 3 is being filled in. For example, the first v values copied from 
Table 2 into Table 3 are 0,40,80,8,48,88,16,56,... etc. Table 3 then pre- 
scribes the index sequence of the scrambled input vector for (7.12), namely, 


A A A 

r 0’ f 96 ,f 72 


' ’ f 104’ f 105* * ,f 89’ f 90’ * * ’ ’ f 119 


To facilitate the analysis of the adopted scrambling, we introduce now 
E, the exponent matrix of Y. Recall that (7.8) implies 

Y = ^(nOMn) (7.13) 

mn 

Hence we define the exponent matrix E as follows: 

E = A(m)A(n) = uv (7.14) 

mn 

Similarly, paralleling the submatrices of Y we define ^ as the upper- 

left V. -order submatrix of E. 
k 

Let us examine now the structure of £ starting with the distri- 
bution of u^jV^. The scrambling prescribes that, starting with the value zero 
at the upper-left corner, u^ should be increased by one every 
rows. Similarly, v k should increase by one every columns. This, then 

defines a subdivision of g into submatrices of order v . g can now 

* c— 1 / ^k-Ji ^ k— 1 

be regarded as a compound matrix of order N. {=- — ^-J whose (r,s) "element" 

k k ' 

is characterized by 

U k = r ’ V k = S (7.15) 

The (0,0) element of this compound matrix is obviously identical with Let 

us pick now an element in an arbitrary position in l ts u,v will have the 

form 


u = 
v = 


(0,...,0,u k+1 ,u k+2 ,...,u |< ) 

(0 0 , V k +l,V2 V 


(7.16) 
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Table 2 

Modular Representation of v in Example (7.12) 
N x = 8; N 2 = 3; N 3 = 5 
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Table 3 

Scrambling for Example (7.12) 

n vs. v= (v^ > v 2 ,v 3 ^ 
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Now, the adopted scrambling scheme guarantees that all elements of <*? 


k-1 

occupying the same identical position in the other submatrices of £ , have 

u.,v. which differ from (7.16) only in u. ,v, and these satisfy (7.15). This 
ii k. K. 

means that the difference between an element in submatrix (r,s) of and 

the corresponding element in .' k (submatrix (0,0)) is just ( [V] > (7.14)) 


AE(r,s) = (0,...,0,rs mod N^,0,...,0) 

+ 

k-th position 


(7.17) 


We notice the important fact that the difference is independent of the specific 
common location of the two paired elements in their respective submatrices. 

Hence, the difference is constant throughout the (r,s) submatrix. In other 
words, the (r,s) submatrix of ^ could be generated from ^ simply by 

adding the constant AE(r,s) to all its elements. 

We are interested in an explicit expression for this important con- 
stant. Considering its implicit formulation (7.17), we conclude that it must 
be some integral multiple of n^ (7.3). Specifically, 

AE(r,s) = n(r,s)n k (7.18) 

where the integer r| satisfies 

n < N k (7.19) 

n k H - rs E 0 (mod N k ) . (7.20) 


Eqn. (7.20) is a linear congruence for the unknown r) for which there is an, 
explicit solution [~8j , namely, 


where 


13 


H = (rs ’) mod N k 

(7.21) 

s' = (s£ k ) mod N k 

(7.22) 

<KN. ) -1 

C = n. k mod N 

k k k 

(7.23) 


The result we have established for the submatrices of <fi, . trans- 

k-1 


lates as follows for the submatrices of 


k-1" 


The (r,s) submatrix of ft. 


k-1 


13 

(}>(n) is the Euler Totient function defined as the number of integers not 
exceeding, and relatively prime to, n. 
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is just 


w AE(r ’ s) a 


Applying 


(7.18), 

w AE(r,s) 


we find that 

. 2tt . . 

-l — n(r,s) 
N k 

= e 


Denoting now 



(7.24) 


(7.25) 


we get 

w AE(r,s) = w n(r,s) 
k 


(7.26) 


so that the (r,s) "element" of the compound N^-th order ^ is 

w, n(r,s) a, - w,, (rs,) mod N k a, -w, rs ’n, 

k k k k k k 

This is very similar to the (r,s) element of the DFT matrix (2.4) of order 
with = ft^- The only difference is that s' has now replaced s. This, 
however, is a trivial difference involving only column permutations. To prove 
this, it is sufficient to show that there is a one-to-one correspondence between 
s and s' so that when s goes through the values 0,1, ... ,11^-1, s' goes through 
a permutation of them. This will, indeed, be the case if (i n (7.22)) is 

relatively prime to N^. But this is guaranteed by (7.23) since n^ is rela- 
tively prime to (see (7.3)). 

We conclude that a trivial modification of the DFT tableau of order 
N^, will evaluate the transformation effected by ft^ Specifically, we should 
modify the input square of the N, tableau by permuting the f. rows/columns 

rC 1 

so that the i sequence would be identical with the s' sequence (instead of 
the natural number sequence (s) of the unmodified tableau)- We refer to such 
a tableau as a "modified tableau" and use this term from now on, only with this 
restricted, precise, meaning. 

We illustrate now the tableau modification with k= 1 in our example 

(7.12) 

n l = N 2 N 3 = 15; <KN 1 ) = <j>(8) = 4 

r >1 = 15 4-1 mod 8 = (-1) 3 mod 8=7 


Hence 
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Fig. 19. Input square of the 8-th order modified tableau for example (7.12) 



Fig. 20. Subdivision of the Y matrix for example (7.29) 
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(7.27) 

The modified input square called for by (7.27) is shown in Fig. 19. 

Note that, in this case, the modification involves only sign changes. 

The result we have established may be summarized as follows: 

Submatrix £1 , of Y is a column , 

k-1 ] 

permutation of the compound DFT I 

matrix of order (2.4) with l (7.28) 

0 = £1 . This is valid for all I 

k I 

This differs from (7.4) in the column permutation. But, as we have just 
pointed out, this only means that' in the algorithm for order N (7.1), the 
standard tableaus should be replaced by the modified tableaus. This is the 
minor complication referred to earlier. 

We turn now to a simple example to illustrate the algorithm developed 
here. The example chosen has the following parameters: 

K= 3; N x = 3; N 2 = 2; N 3 = 5; /.N= 30 (7.29) 

/v 

The first step is the scrambling of the input vector f to yield the vector q. 

A 

This and the concurrent scrambling of the output F, transform the DFT matrix 
into the Y matrix of (7.13). Y is now subdivided as indicated schematically in 
Fig. 20 (the indicated "measurements" refer to the number of rows/columns). The 
next step is to regard Y(=f) Q ) as a compound third-order matrix (N^ = 3) and 
apply the third order modified DFT tableau with the tableau's S7 identified with 
of Fig. 20. The application of the tableau is shown in Fig. 21 which utilizes 
the schematic tableau representation introduced in Fig. 18. Phase 1 is shown at 
the top (circled 0 q ) , phase 3 is at the bottom (fi^ in a square) and phase 2 
is all the region in between. Phase 1 requires separation of the input vector 
into its three "components". This is implemented in the following straight- 
forward manner: 





f T 31 





UNSCRAMBLING 

TT 

F f 30 

Fig. 21. Algorithm for DFT of order 30 (example (7.29)) 
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q 0 


q 10 


q 20 

f o = 

q9 J 

; f l = 

q 19 

5 f 2 = 

q 29 


(7.30) 


The outputs of phase 1 are the three 0^ vectors of dimensionality 10. Before con- 
sidering phase 2, we have to introduce the generalized notation we use there 

for the oo.'s. co. for the DFT tableau of order N is referred to now as to. 

xi i,N 

(to. , also appearing in Fig. 21, will be defined later). These constants are 
i>N 

either explicitly listed in the tableaus or are inferred from the convention 

that every 0^ is multiplied by (oo^Sl) . For example, the DFT tableau of 

order 5 states explicitly to, c = Implicitly, we infer to _ = 1. All 

4 0 y ^ 

the (o..’s are. tabulated in Table 4 for convenience. The values there have 
ij 

been computed on a 10-digit calculator. For more precise values, one should 
refer to the exact expressions in the tableaus. 

Returning now to Fig. 21, we find that phase 2 of the third order 
tableau requires the evaluation of = (co^fi^E^. Here we apply again result 

(7.28) which implies (with k = 2) that the transformation ((O^^-j^B^ could 
be evaluated with the modified second order DFT tableau (with the tableau's 
identified with ( (0 i 3 ^ 2 ^ (see 20))# Each of the three 10-dimensional 

0^'s is therefore shown in Fig. 21 as the input to phase 1 of a second order 
tableau. In each of these applications of the second order tableau, phase 1 
yields a pair of 5-dimensional 0 vectors. Consider now the specific 5-dimen- 
sional 0 vector on the extreme left of Fig. 21. It has been generated by 
phase 1 of the tableau implementing the transformation based on the matrix 
(ojQ^f!!^) . Therefore, phase 2 calls for its transformation by the matrix 
^02^03^2 * Fig. 21 shows, instead, the matrix ^02^2* W e have adopted here 
the following somewhat unusual terminology: 

oth_. = co„*(the first multiplier met in moving 

against the arrows in the upper half of 
Fig. 21) (7.31) 

Thus, u.. is only defined with respect to a specific diagram. Furthermore, 
the same symbol may have a different numerical value at a different location 



Table 4 


The DFT Tableau Multipliers (o) ^. 


il 

2 

3 

e 

5 

7 

8 

9 

16 

0 

1 

1 

i 

1 

1 

1 

1 

1 

1 

1 

-1.5 

i 

-1.25 

-1.1666667E-0 

7 . 0710678E-1 

0.5 

1 

Q 


i8. 6602540E-1 

i 

-5. 5901699E-1 

5 . 5854267E-2 

1 

-1. 7364818E-1 

1 

11 



i 

H.5388418E-0 

7 . 3430220E-1 

1 

0.5 

7 . 0710678E-1 

I 




-i3 . 6327126E-1 

-18.7484229E-1 

1 

9. 3969262E-1 

1 

Hi 




-15.8778525E-1 

-15.3396936E-1 

1 

-13.4202014E-1 

-1 . 3065630E-0 






-i4.4095855E-l 

1 

-i8. 6602540E-1 

i 

H 





7 . 9015647E-1 

17.0710678E-1 

-i9. 8480775E-1 

5.4119610E-1 

8 





-i3.4087293E-l 


-i8. 6602540E-1 

1 

9 







7. 6604444E-1. 

17.0710678E-1 

10 







-i6. 4278761E-1 

i7 . 0710678E-1 

11 








i 

12 








i 

13 








-il . 3065630E-0 

14 








17 . 0710678E-1 

15 








-i5 . 4119610E-1 

16 








3 . 8268343E-1 

17 








i9. 2387953E-1 


NOTE: Numbers shown in 3 digits or less are exact. 

All other numbers are in Fortran E-format (3.4E-1 = 3.4 x 10 . 
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in the diagram. For example, we have just seen that on the extreme left, 

£Sq 2 = W 02 W 03' This symbol appears in two other places to the right. The first 
one equals w o2 u) 13 ’ secon< ^ equals (JJ Q2 a) 23 ' 

Returning now to the main line of the argument, each one of the 5- 
dimensional B vectors should now be input to a modified DFT tableau of order 5. 
This time, the 3 "vectors" generated in phase 1 are of dimensionality 1, 
namely, scalars. They are to be multiplied by rn^ft^ = ^5 (ft^ i- s the scalar 
1) . At this point, the multiplications are actually carried out and the numerical 
values of the multipliers are needed. Their determination is straightforward. 
Consider for example the multiplier on the extreme left of Fig. 21 


0 ) 


05 


“05^02 U 05 (i) 02 a) 03 


Similarly, for the multiplier on the extreme right 

U 55 = U) 55 U) 12 = U 55 UJ 12 a) 23 

and in general, the value of the multiplier at a specific node is the product of 
all the (iKj terms (stripped of their circumflex) which one encounters in moving 
from that node up the (inverted) tree structure to the stem. 

The multiplications are indicated by the half-circle, half-square 
shapes stringed along the center line of Fig. 21. (These can be regarded as rep- 
resenting the combined three phases of the DFT algorithm of order 1 with 

ft = w. _) . 
i5 

The 36 terms we get after performing the multiplications comprise 
6 independent groups resulting from the 6 separate applications of the modified 
tableau of order 5. The terms of each of these groups are now combined as pre- 
scribed in phase 3 of the 5-th order tableau to yield six 5-dimensional vectors. Each 

A 

of these is actually a term of the form w-j^ftz^i required to complete the computa- 
tions in the three applications of the second order tableau. These computations now 
yield, as tableau outputs, three 10-dimensional vectors which are identical with 
°f the third order tableau. Combining these as prescribed by phase 3 of 
this tableau yields the 30-dimensional Q vector which is just a scrambled version 

A 

of the desired vector F. 

Fig. 21, though directly applicable to example (7.29) only, is char- 
acteristic of all N values. For larger N, there might be one more level of 
branching in each half of the diagram and the number of branches per node may be 
higher. Otherwise, the structure is identical with that of Fig. 21. 
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VIII . Speed Analysis 

In sections IV-VI we constructed the basic DFT tableaus. In section VII 
we showed how to use them in an efficient algorithm for certain N values. Our pur- 
pose in this section is to determine just how fast the resulting algorithm is and 
present a summary (Table 6) of the pertinent parameters for all orders realizable 
with the constructed tableaus. The basis for these developments is Table 5 which 
presents a summary of the tableau parameters. These have been collected from the 

tableau designations and have been fully explained earlier. The values appearing here 
are in general agreement with Winograd's results 1 ^ (Table 1 of [jQ) . The only dif- 
ference is in the number of multiplications in the tableau of order 9. Winograd uses 
13; we use 11. This means that, using this tableau in the computation of any DFT 
whose order is divisible by 9, will yield a 15% reduction in the number of multipli- 
cations as compared to Winograd's results (see (8.3)). In most of these cases there 
is also a reduction in the number of additions (see (8.9)). 


rs. 

Consider now the algorithm of order N= J I N, as applied to complex 

15 k= l K 

data . For each N^, we read off from Table 5 the corresponding number of complex 

multiplications and complex additions A^. We are interested in two func- 

tions of these variables, namely, the total number of real multiplications and 
the total number of real additions u/ for the overall algorithm realized in the 
order implied in (7.1) (phase 1 of the tableau realized first; phase 1 of 

the N, tableau realized last). With this goal in mind, we turn now to a mathe- 
matical formulation of some of the characteristics of the algorithm structure which 
are quite evident in Fig. 21. 

We note that the output of phase 1 of the N-^ tableau is a set of 
vectors (8^) of dimensionality V^. Each of these now generates (at the output of 
phase 1 of the ^ tableau) M„ vectors of dimensionality and so on. It is 

obvious therefore that 


The total output of phase 1 of all the 

N, consists of 


tableaus of order ... 
k , k 

vectors of dimensionality 


ttW 




> ( 8 . 1 ) 


14 

Our M values, representing the total number of multiplications, should be 
compared to the sum of Winograd's two multiplication columns. 

15 The computation of the number of arithmetic operations for real data is more 

involved and will only be briefly discussed later on. 
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Table 5 

Summary of Basic DFT Tableaus 
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As each one of these vectors is fed to phase 1 of an N fc+1 
number of tableaus of order or, equivalently, 

The number of tableaus of 
k-1 

order is |~| 

1=1 


tableau , 


this is also the 


( 8 . 2 ) 


The simplest application of these results is the determination of the 
number of multiplications. Let ’ be the total number of (complex) scalar mul- 

l\ 

tiplications in phase 2 of the algorithm realized as a cascade of k stages. The 

variables which are multiplied are the 1-dimensional $. 's generated in the last 

1 K 

stage of the cascade (N K ) . From (8.1) we know that there are |~J such terms. 

Hence 





(8.3) 


To get the total number of real multiplications in the overall cascade , we note 

that in each of the counted multiplications, only one of the two factors is complex, 
the other being real or imaginary (see Table 4). Therefore, 

< 8 - 4 > 


We have seen in Section VII that each of the overall multipliers for 
the cascade is a product of K tableau multipliers, one from each tableau of the 
cascade. Since each tableau has, at least, one multiplier whose value is 1 or i 
(m >1 in Table 5), some of the overall multipliers will be 1 or i. A 

consideration of the structure of Fig. 21 shows that the number of such multipliers 


p k ■ n "i 

i=l 


(8.5) 


l^All permissible N values are expressible as N = H2 r (H odd; o £r <A) . 

Using this with the values of m, listed in Table 5, yields P = 2r + 6 

i ’ J K o ,r 
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Therefore, if one accepts the somewhat more complex programming involved in 
the special handling of these P^ trivial multipliers, the DFT can actually 
be computed with the smaller number of real multiplications 

jl= 2 (yKf P K ) (8.6) 

The summary in Table 6 covers both cases ^(8.4), (8.6)^."^ 

We turn now to the additions count. Let be the total number of (com- 

plex) scalar additions in a cascade of K stages. Hence, the corresponding number 
of real additions is 

ot = 2^ (8.7) 

K 

The simplest way to determine is through a recursive argument. Assume that we 

have the result for a cascade of length K— 1 and we add to it one more stage at 
position K. This has two* effects. First, the additions in the first K-l stages 
will now involve vectors whose dimensionalities are N times their previous values. 
Hence, the previous additions are transformed into N ^ 1 additions. To 

this we should add the additions of the last stage. From (8.2), the number of tab- 
leaus in the last stage is j"J^ ^ (8.3). Each such tableau requires 

additions of scalars. Hence, the number of scalar additions in the last stage is 

.y// .A and the total is 
1 K 


One could argue that in a binary machine, multiplication by — is also trivial 
and should be excluded from the multiplication count. If this attitude is 

adopted, then the 9th order tableau will have 3 trivial multiplications (m = 3). 

✓v. 

The effect of this will be quite pronounced for N = 144, reducing from 380 
(Table 6) to 348. 
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~*^C-1 N K^K-1 A K 
tsf/ K .^K_l M k 


= A l ; -*i - M 1 




> ( 8 . 8 ) 


We have added here a recursive rephrasing of (8. A) as well as the initial conditions 
to provide a complete prescription for the simultaneous computation of both and 
(8.8) also yields explicit formulae for ^ . For example 

^4 “ A 1 N 2 N 3 N 4 + 

+ V 2 n 3 n 4+ 


4-M M.A.N.+ 
12 3 4 

+M 1 M 2 M 3 A 4 


(3.9) 


An important fact clearly indicated by (8.9) is that j4 , unlike ^ , is also a 
function of the order of the N^'s com Ptising N. Thus, it is important to find 
the cascade order minimizing ^ . 


With eqns. (8.3)-(8.8) and Table 5 at our disposal, we can now compute 
for any N satisfying (7.1). Table 6 presents the results of such computations 
implemented by a simple computer program which also provides information for the 
selection of the most efficient cascade ordering. This sequence appears 

in the last columns of Table 6. We provide here room for two sequences since, in 
some cases, the same values of are obtained with two different 

sequences. In this case, the choice could be governed by arguments other than 
efficiency. 

Each value appears with a bracketed number to its right. This is the 

°f (7.22). Thus, Table 6 provides both the sequence of tableaus to be realized 
and the permutation required in the input square of each tableau (see discussion 
preceding (7.27). 

Note that adoption of the order prescribed in Table 6 is quite important. 

For example, with N = 240, Table 6 states = 5016 with the prescribed order 
3; 16; 5. If, instead, we adopt the order 5; 16; 3, the number of real additions 
jumps- to 5592 — an increase of 11%. 

We turn now to the two remaining columns of Table 6, namely, G^, R. These 
are the two parameters mentioned in section I and are required to determine the speed 
advantage in a specific system. 
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TABLE 6 

Summary of DFT Algorithm 

time for one real multiplication ., 
time for one real addition ' 


Order 

of 

DFT 


Multiplications by 1 and i, 
included in count 


Multiplications by 1 and i, 
excluded from count 


Number 
of Real 
Addi- 
tions 



Tableau 

Realization Sequences 

W 

Speed-Gain 

Function 

Parameters 

Number 
of Real 
Multipli- 
cations 

” 


Speed-Gain 

Function 

Parameters 

Number of 
Real 

Multipli- 

cations 


k 


k 

N 


G 

CO 

R 



G «, 

A 

R 

A 

M 




1 

2 

3 

4 


B 

2 

3 

4 

2 


1.000 

1.000 

4 


_ 


0 


4 


2(1) 









3 


1.585 

2.000 

6 


2.377 

3.000 

4 


12 


3(1) 









4 


2.000 

2.000 

8 


- 

- 

0 


16 


4(1) 









5 


1.935 

2.833 

12 


2.322 

3.400 

10 


34 


5(1) 









6 


2.585 

3.000 

12 


3.877 

4.500 

8 


35 


3(2) 

2(1) 




2(1) 

3(2) 



7 


2.183 

4.000 

18 


2.456 

4.500 

16 


72 


7(1) 









8 


3.000 

3.250 

16 


12.000 

13.000 

4 


52 


8(1) 









9 


2.594 

4.000 

22 


2.853 

4.400 

20 


88 


9(1) 









10 


2.768 

3.667 

24 


3.322 

4.400 

20 


88 


2(1) 

fffll 








12 


3.585 

4.000 

24 


5.377 

6.000 

16 


96 


3(1) 

Ell 




4(3) 

3(1) 



14 


2.961 

4.778 

36 


3.331 

5.375 

32 


172 


2(1) 

7(4) 








15 


3.256 

4.500 

36 


3.447 

4.765 

34 


162 


3(2) 

5(2) 








16 


3.556 

4.111 

36 


6.400 

7.400 

20 


148 


16(1) 









18 


3.412 

4.818 

44 


3.753 

5.300 

40 


212 


2(1) 

9(5) 








20 


3.602 

4.500 

48 


4.322 

5.400 

40 


216 


4(1) 

5(4) 








21 


3.416 

5.556 

54 


3.548 

5.769 

52 


300 


3(1) 

7(5) 








24 


4.585 

5.250 

48 


6.113 

7.000 

36 


252 


3(2) 

8(3) 




8(3) 

3(2) 



28 


3.739 

5.556 

72 


4.206 

6.250 

64 


400 


4(3) 

7(2) 








30 


4.089 

5.333 

72 


4.330 

5.647 

68 


384 


3(1) 

2(1) 

5(1) 



2(1) 

3(1) 

5(1) 


35 


3.325 

6.167 

108 


3.387 

6.283 

106 


666 


7(3) 

5(3) 








36 


4.230 

5.636 

88 


4.653 

6.200 

80 


496 


4(1) 

9(7) 








40 


4.435 

5.542 

96 


5.069 

6.333 

84 


532 


8(5) 

5(2) 








42 


4.194 

6.333 

108 


4.355 

6.577 

104 


684 


3(2) 

2(1) 

7(6) 



2(1) 

3(2) 

7(6) 


45 


3.744 

6.167 

132 


3.802 

6.262 

130 


814 


9(2) 

5(4) 








48 


4.964 

5.889 

108 


5.828 

6.913 

■ 92 


636 


3(1) 

16(11) 








56 


4.517 

6.528 

144 


4.927 

7.121 

132 


940 


8(7) 

7(1) 








60 


4.922 

6.167 

144 


5.212 

6.529 

136 


888 


3(2) 

4(3) 

5(3) 



4(3) 

3(2) 

5(3) 


63 


3.804 

7.111 

198 


3.843 

7.184 

196 


1408 


9(4) 

7(4) 








70 


3.973 

6.815 

216 


4.048 

6.943 

212 


1472 


2(1) 

7(5) 

5(4) 







72 


5.048 

6.659 

176 


5.417 

7.146 

164 


1172 


8(1) 

9(8) 








80 


4.683 

6.259 

216 


5.058 

6.760 

200 


1352 


16(13) 

5(1) 








84 


4.972 

7.111 

216 


5.163 

7.385 

208 


1536 


3(1) 

4(1) 

7(3) 



4(1) 

3(1) 

7(3) 


90 


4.426 

6.848 

264 


4.494 

6.954 

260 


1808 


2(1) 

9(1) 

5(2) 







105 


4.352 

7.463 

324 


4.379 

7.509 

322 


2418 


3(2) 

7(1) 

5(1) 







112 


4.706 

7.198 

324 


4.951 

7.571 

308 


2332 


16(7) 

7(4) 








120 


5.756 

7.208 

288 


6.006 

7.522 

276 


2076 


3(1) 

8(7) 

5(4) 



8(7) 

3(1) 

5(4) 


126 


4.440 

7.747 

396 


4.485 

7.827 

392 


3068 


2(1) 

9(2) 

7(2) 







140 


4.621 

7.463 

432 


4.708 

7.604 

424 


3224 


4(3) 

7(6) 

5(2) 







144 


5.214 

7.364 

396 


5.434 

7.674 

380 


2916 


16(9) 

9(4) 








168 


5.750 

8.083 

432 


5.914 

8.314 

420 


3492 


3(2) 

8(5) 

7(5) 



8(5) 

3(2) 

7(5) 


180 


5.108 

7.530 

528 


5.187 

7.646 

520 


3976 


4(1) 

9(5) 

5(1) 







210 


5.000 

8.111 

648 


5.031 

8.161 

644 


5256 


3(1) 

2(1) 

7(4) 

5(3) 


2(1) 

3(1) 

7(4) 

5(3) 

240 


5.857 

7.741 

648 


6.005 

7.937 

632 


5016 


3(2) 

16(15) 

5(2) 







252 


5.076 

8.384 

792 


5.128 

8.469 

784 


6640 


4(3) 

9(1) 

7(1) 







280 


5.269 

8.273 

864 


5.343 

8.390 

852 


7148 


8(3) 

7(3) 

5(1) 







315 


4.401 

8.759 

1188 


4.409 

8.774 

1186 


10406 


9(8) 

7(5) 

5(2) 







336 


5.802 

8.580 

972 


5.899 

8.724 

956 


8340 


3(1) 

16(13) 

7(6) 







360 


5.790 

8.383 

1056 


5.856 

8.479 

1044 


8852 


8(5) 

9(7) 

5(3) 







420 


5.648 

8.759 

1296 


5.683 

8.814 

1288 


11352 


3(2) 

4(1) 

7(2) 

5(4) 


4(1) 

3(2) 

7(2) 

5(4) 

504 


5.713 

9.179 

1584 


5.756 

9.249 

1572 




8(7) 

9(5) 

7(4) 







560 


5.260 

8.831 

1944 


5.303 

8.905 

1928 




16(11) 

7(5) 

5(3) 







630 


4.931 

9.290 

2376 


4.940 

9.305 

2372 




2(1) 

9(4) 

7(6) 

5(1) 






720 


5.753 

8.970 

2376 


5.792 

9.031 

2360 


WKttm 


16(5) 

9(8) 

5(4) 







840 


6.296 

9.569 

2592 


6.326 

9.614 

2580 


24804 


3(1) 

8(1) 

7(1) 

5(2) 


8(1) 

3(1) 

7(1) 

5(2) 

1008 


5.644 

9.727 

3564 


5.669 

9.771 

3548 


34668 


16(15) 

9(7) 

7(2) 







1260 


5.462 

9.820 

4752 


5.471 

9.836 

4744 


46664 


4(3) 

9(2) 

7(3) 

5(3) 






1680 


6.173 

9.984 

5832 


6.190 

10.011 

5816 




3(2) 

16(9) 

7(4) 

5(1) 






2520 


5.992 

10.483 

9504 


6.000 

10.496 

9492 


Hull 


8(3) 

9(1) 

7(5) 

5(4) 






5040 

J 

5.798 

10.939 

21384 


5.802 

10.948 

21368 


233928 


16(3) 

9(5) 

7(6) 

5(2) 
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_ time for one real multiplication 
time for one real addition 


in the specific system considered. We define the gain G, of the present algorithm 
over the (nominal) Cooley— Tukey algorithm by 


h ^ CT +^/ GT 

(j — 

where ‘ y ^ CT *^ CT are the Cooley-Tukey parameters introduced in (1.1). 


( 8 . 10 ) 

Obviously, G 


is the ratio of the time required by the Cooley-Tukey algorithm to the time required 
by Winograd s algorithm. We refer to it as the speed gain. It is a function of y 
and the four parameters appearing in (8.10) . Eqn. (8.11) is a more convenient two- 
parameter formulation (see ( 1 . 1 )). 


where 


G(y) =g ot 


y+1.5 

y+R 


*^CT 


( 8 . 11 ) 

( 8 . 12 ) 



(8.13) 


is the asymptotic speed gain that is approached with large y. R prescribes a 

pole of the G(y) function (at y = - R) and thus determines the second asymptote. 

Adding to these two the zero of G(y) (at y =-1.5) makes it very easy to sketch 

18 

G(y) and get a sufficiently precise estimate of it. 


We conclude with a few words regarding the real data case. The number of 
arithmetic operations here is not half the number in the corresponding complex data 
case. The main reason for that is that as the DFT of a real vector is, in general, 
complex, some of the intermediate entities will be complex too. For example, the 
8 -th order DFT tableau prescribes 


B 5 = e 2 +ie 5 (8.14) 

and thus 8 ^ is complex even for real data. This means of course that the tableaus 
realizing ^5 = ^5 have complex data inputs. 

Another factor to consider is the fact that the construction of a complex 
number, given its two components, does not involve any arithmetic addition in spite 
of the appearance of the plus symbol. In the example previously cited, both 
and e,. are real when the f_^'s are real. Hence, the "addition" appearing in 
(8.141 is free, 

■*■^[0 eqns, (8,10) - (8.13), replacing ,,/if of (8.4) by jfc of ( 8 . 6 ), yields the 
circumflexed entities G, G^, R appearing in Table 6 . 
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IX . Concluding Remarks 

We have tried to present here an orderly development of Winograd's DFT 
algorithm, starting with the general concept, continuing with the construction of 
the necessary building blocks and culminating in a detailed description of their 
incorporation into the overall algorithm. 

In applying the algorithm, Table 6 (p. 84) is the starting point 
as it lists all the permissible N values with their associated performance param- 
eters. Having chosen a particular N, the next step is to consult Table 5 (p. 80 ) 
in order to locate the specific tableaus called for in Table. 6. In the actual 

implementation, one starts with the scrambling of the input vector and then 
applies phase 1 of the tableaus (in the order prescribed in Table 6) to smaller 
and smaller segments of the data vector in its various partially transformed states. 
This culminates in single component "segments" finally being multiplied by constants 
in phase 2. From here on, the process is reversed in the application of phase 3 
of the tableaus: The scalars appearing at the output of phase 2 are combined into 

vectors of higher and higher dimensionality, finally culminating in an N-dimensional 
vector which is just a scrambled version of the transformed input vector. 

The present paper contains sufficiently detailed information upon which 
one could base a direct, straightforward, implementation of the above process in 
either hardware or software. There are, however, some less obvious implementations 
which have various advantages. These are the subject of a forthcoming paper and 
will not be discussed here. Nevertheless, attention should be called to a certain 
feature of the tableaus specifically designed into them to facilitate the applica- 
tion of these special techniques. We are referring here to "in-place" transforma- 
tion. Consider, for example, the 7-th order tableau (Fig. 8). Note 
that the input components ^2*^5 are usec * to compute c^,c^ and nothing else. 
Hence, there is no need to assign additional storage for C 2 >c^. They may be stored 
back into the f array, overwriting The only requirement is for a tem- 

porary storage for, say, f 2 so that after we store c ^ we still have f 2 avail- 
able for the computation of c,-. Note that even if f 2 represents a vector, we 
still need only a one-word temporary store since the computation is carried out one 
component at a time. 

This property which we have illustrated here with the -*■ (c 2 »c,-) 

transformation is common to all variables in the DFT tableaus of orders 2, 3, 5, 7. 
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It is also valid for the remaining tableaus, if we regard r| as the output vector. The 
only deviation from the above pattern is that, in some cases, groups of 3 components 
(rather than 2) have to be considered. In the above tableau, the computation of 

* ^2 * ^3 suc ^ an exam Pl e * Note however, that, even in this case, a one-word tem- 

porary store is sufficient. 

It should be pointed out that in those applications in which this "in- 
place" feature is not utilized, the tableaus of orders 4,9,8,16, may be simplified 
somewhat by permuting the F^ rows/columns in the output square to yield an 
unscrambled output vector F. In this case, of course, the vector r) may be dis- 

pensed with. 

We turn now to a brief consideration of the precision disadvantage men- 
tioned in section I. We have already seen in Appendix A (last paragraph) that some 
of the manipulations generating the tableaus have a detrimental effect on precision. 

A similar situation afflicts the computation of 6^ in some of the tableaus. 
Examination of eqns. (4.11), (4.12) reveals that the adopted formulation (4.12) 
involves addition and subtraction of Q$^. Hence if >> |6^| we are 

bound to have problems, namely, loss of significant bits in floating point arithmetic 
and tendency to overflow in fixed-point arithmetic. Similar addition-subtraction 
manipulations are dispersed in various disguises throughout the tableaus' deriva- 
tions. 

The effect of these peculiarities of the tableaus is that to guarantee a 
certain measure of precision in the transformation, we probably need more bits per 
word than in the Cooley-Tukey algorithm. We do not analyze this effect here but it 
should be pointed out that the structure of the algorithm as displayed in Fig. 21 
makes such an analysis relatively simple. 

Finally, we conclude with yet another important aspect of the algorithm 
brought forth in Fig. 21, namely, the suitability of its structure to the applica- 
tion of various schemes of parallel processing and pipelining. Indeed, there is 
fertile ground here for all sorts of ingenious designs and variations. As the 
algorithm becomes more widely known, more and more of these will undoubtedly mater- 
ialize. 
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Appendix : Polynomial Congruences 

The derivations in section III require numerous evaluations of 


R(x) = S n (x) mod m(x) 


(A.l) 


where 


S n (x) " T.V 
k=0 


(A. 2) 


and m(x) is a monic polynomial of degree 1 or 2, whose roots lie on the unit 
circle. We establish here all the needed results. 

1_. m(x) = x—Xq (Xq = +1 ) 

R(x) must be of degree 0 


R(x) = r r 


(A. 3) 


and (A.l) is equivalent in this case to 


S n (x) = (x-x 0 )Q n _ 1 (x) + r Q 


(A. 4) 


where Q n _^(x) i s a polynomial of degree (n-1) . Substitutiong x = x^ in (A. 4) we 


R(x) = r Q = S n (x Q ) 


(A. 5) 


Note that with Xq = +1, r^ is a multiplication -free algebraic sum of the coef- 
ficients of S (x) . 

n 

2 _ 

2 . m(x) = x -2x cos9 +1= (x-Xq) (x-x ^) 

R(x) must be of degree 1 


R(x) = r Q + r^x 


and (A.l) is equivalent to 


(A. 6) 


S n (x) = (x-x 0 )(x-x 0 )Q n _ 2 (x)+(r Q +r 1 x) 


(A. 7) 


Hence 


S n (x 0> = r 0 +r l x 0 

S n (x 0 } = r 0 +r l*0 


(A. 8) 
(A. 9) 


Note that 


Xq = cos0 + i sin0 = e 
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Hence, subtracting (A. 9) from (A. 8) yields 


2ir lS in0 = = XX^ 9-6 ik9) 


k=0 


’ ' r l = ¥in9 >®k Slnk0 




n 


r i ‘ 0) 

k=l 


(A. 11) 


where U m (x) is the Chebyshev polynomial of the second kind. 


To get r^ ? we multiply (A. 8) by Xq and (A. 9) by x^ and then subtract, 


getting 


n 


2i r Q sine = x Q S n (x 0 )-x 0 S n (x 0 ) = £ ^(e l(k 1)0 -e l(k 1)0 ) 


k=0 


n 


= 2iSgSin0 -2i^ ^ s^ sin(k-l)9 


k=2 


- =0 - E \ \-2 <cos e) 


(A. 12) 


The specific m(x) polynomials for which R(x) is required are listed in Table 

A1 with the corresponding 0 values. Note that for these values of 0, U k (cos0) 

takes only the values 0, +1. Hence, both r^ and r^ are multiplication-free 

algebraic sums of the coefficients of S^Cx). The results for all required degrees 

of S (x) are shown in Table Al. 
n 

A specific application of Table Al is the following: Given 

P(x) mod m(x) = p^x+p^ (A. 13) 

Q(x) mod m(x) = q^x+q^ (A. 14) 

Find 


G(x) = g^x+gQ = (P(x)Q(x)} mod m(x) 

G(x) can be expressed in terms of (A. 13) (A. 14) as follows: 


(A. 15) 



Table Al: S (x) mod m(x) 

n 


n 


S n (x) = 


s*0 


S, X 

k 
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G(x) ={[P(x) mod m(xQ JTj(x) mod m(x)T] } mod m(x) 

2 

= {(Pjq-^x + (p 1 q Q +P 0 q 1 )x+p 0 q () } mod m(x) (A. 16) 

Identifying the bracketed polynomial with S£(x) in Table Al, we get the following 
results 


G(x) = JPi^ 0 +q l )+P O q l] X+ (p 0 q 0" P l q l ) 
G(x) = (p 1 q 0 +p 0 q 1 )x+ (p 0 q 0 -p iq;L ). 

G(x) = [Pi< q 0 - q i )+p o q i] x+(p o q o" p i q i ) 


2 

(m(x) - x -x+ 1 ) 
(m(x) = x 2 +l) 
(m(x) = x 2 +x+l) 


(A. 17) 
(A. 18) 
(A. 19) 


Starting with p^.q^, each of these formulae requires 4 multiplications. However, 

with the proper rearranging of terms, we can replace one of these multiplications 

with an extra addition, thus getting faster computation. This is accomplished by 

adding and subtracting Pq9q from the g^ term. This is sufficient for (A. 17), 

(A. 19). In (A. 18) we also modify the gg term by adding and subtracting Pgq^. 

The results are summarized in Table A2 and, as we see there, 3 multiplications 

2 

are now sufficient. The x +1 case, however, seems to indicate that the price 
is higher than previously stated, namely, 3 extra additions. In general this is 
indeed true. However, we intend to apply this result to a situation where arith- 
metic operations involving Pg>P]^ only, 80 not count (precomputation). Under 
these conditions, the price is indeed 1 extra addition. 

It should be noted that the higher speed realized by the formulae of Table 
A2 is accompanied by the disadvantage of requiring more bits per word. In fixed 
point arithmetic we will need more bits to prevent overflow of the intermediate 
results. In floating point arithmetic, we will need more bits to prevent loss of 
precision. Consider the extreme case in which Pq^o >> ®1’ (A* 17) would not be 

affected by that but, in floating point, Table A2 could yield a value for g^ 
which would be pure noise. 



m(x) 


{P(x)Q(x)} mod m(x) 


x 2 -x+l 

x 2 +l 

x 2 +x+l 


pPi+Po) ( £ ii +c io ) “ p o q o| x+ (p o q o" p i q i ) 
[ (p r p o )q o +(q i +q o )p 3 x+ (q l +q 0 )p 0 
[ (p r p o } + p o q 3 x+ (p O q O _P l q l 


- (p l+ p 0 )q l 

) 
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